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TOv THE REVISED EDITION ' 

During the fourteen years that have elapsed since the first edi- 
tion of this book was published there has been a very considerable 
extension of tjie use of statistical methods in business^ in public 
■ administration, and in all the social sciences. The pressing require- 
ments, of new tasks and new problems, together with increasing 
knowledge of statistical procedures on the part of administrative 
and. research workers^ have contributed to this extension. With 
this development, the older controversies over qualitative versus 
quantitative methods have largely been shelved. It is clear that 
different problems call for different procedures; that the men who 
are grappling with research problems differ,' as regards the methods 
of analysis they find congenial and fruitful; that induction and 
deduction are complementary phases of the processes that lead to 
scientific advance. The choice of research procedures does not 
necessitate the acceptance of one method and the rejection of an- 
other; it calls for the finding of a blend of methods that is adapted 
to a particular set of problems, and that is suited to the tempera- 
ment and abilities of the human agent that employs them. For 
workers dealing with social and economic relations, statistical 
methods constitute an essential element of this blend. Knowledge 
of systematic procedures for handling quantitative data, and skill 
in their use, are necessary parts of the equipment of students of the 
social sciences and of publi-c and private administrators who must 
utilize the facts of experience in the formulation of policies. 

Gains on this front have been paralleled by notable improve- 
ments in statistical techniques. The post-war years have witnessed, 
in this field, the initiation of such another period of intellectual 
ferment and creative activity as that which, earlier, brought the 
great contributions of Karl Pearson and his associates. The older 
instruments of quantitative analysis have been refined and sharp- 
ened; methods of designing statistical experiments and formulat- 
ing and testing hypotheses have been improved; statistical infer- 
ence has been placed on a sounder foundation. There can be no 
doubt that these continuing improvements in the logic and in the 
technique of statistics will contribute in important ways to the 
advance of the social sciences and to the betterment of public and 
private, administration. 

Vii ' . . 



viii .PREFACE TO RE VISED EDITION ' - 

In preparing the present edition of Statistical Methods B^ccoiint 
has been taken of the more important of the recent developments 
that have a bearing on the economic and business applications of 
statistics. In doing this I have sought to retain the main features 
of the first edition. A.systematic development of the fundamentals 
pf statistical method is needed by the beginning student. A work- 
ing compendium of procedures, with necessary aids to calculation 
and reference tables, is required by the statistician engaged in ad- 
ministration or research. The book is designed to meet these two 
needs. 

The eighteen chapters of the present edition fall into two main 
divisions. The first twelve chapters deal with the descriptive as- 
pects of statistics. Induction and sampling aT§‘ purposely omitted 
in this development of basic descriptive procedures. Problems of 
statistical inference, with certain more advanced aspects of statis- 
tical description, are discussed in the last six chapters, and in 
appendices A to E. This organization is, I think, well adapted to 
the needs of instruction. Some teachers may, indeed, prefer to 
introduce at an earlier point the concepts of samples and parent 
populations and the treatment of sampling errors. If so, selected 
pages from the chapter on elementary probabilities and the normal 
curve (Chapter XIII) and from the introductory chapter on induc- 
tion (Chapter XIV) may follow Chapter V in the sequence of 
study. 

In the chapters added to this edition I have sought to exemplify 
economic applications of the newer methods of analysis. These 
methods offer rich and, as yet, largely unexplored possibilities to 
research workers in the social sciences. In these sections I have 
drawn heavily on the path-breaking work of R, A. Fisher. 1 am 
indebted to Dr. Fisher and his publishers, Oliver and Boyd of 
Edinburgh, for permission to include in this book the tabulations 
that appear in certain of the Appendix Tables. These, yrith the 
other tables included, are designed to make the present book a 
reasonably complete working manual adapted to the needs of both 
laboratory worker and student. 

I must reaffirm my thanks to those who assisted me in various 
w^ays in the preparation of the first edition. I am indebted, in 
addition, to Jacob M. Gould, Agnes B. Omundsoii, and William H. 
Mills for valuable aid in the details of the revision. 


May, 1938. 


F. a M. 



PREFACE TO THE FIRST EDITION 

The last decade has witnessed a remarkable stimulatioii of 
interest in quantitative methods in business and in the social 
sciences. The day when intuition was the chief basis of business 
judgment and unsupported hypothesis the mode in social studies 
seems to have passed. Following the lead of workers in the older 
and traditionally more accurate physical sciences, social scientists 
and serious students of business are employing in greater measure 
than ever before a method of study based upon the observation 
and analysis of facts. When these observations are quantitative 
in character appropriate methods are necessary for their organiza- 
tion and intei-pretation. This book deals with methods of com- 
bining- and analyzing such observations, with primary emphasis 
upon .materials drawn from the fields of economics and business. 

The justification for limiting the treatment to these particular 
fields is two-fold. Although general statistical methods are prac- 
tically universal,, in their application, special problems are en- 
countered in every field of study. This is particularly true in the 
realm of economics.,jiwhich presents many distinctive difficulties 
and many charaeteristiS^ problems. Methods that are in some 
degree ..specialized to meet these particular requirements have 
been developed, and these methods call for treatment in a work 
that is restricted in scope. Iti the second place, methods can 
be most effectively explained in terms of particular subjects; ab- 
• strdct methodology is barren of interest to the average person. 
For these reasons the book has been written with reference to the 
specific needs of quantitative workers in economics and business. 

In the explanation of methods no attempt has been made to 
secure the brevity of exposition which may be desirable in a 
strictly mathematical work. The purpose throughout has been 
to write for the learner not for the finished master, and the expla- 
nations have been prepared with the needs of the former in mind. 
I have felt free to omit certain detailed demonstrations of theorems 
because this book is presented as an introduction to the subject, 
not as an exhaustive treatise. 

The methods of quantitative analysis that are in general use 
today represent a long accretion, an accumulation of contribu- 
tions from workers in many fields. It would be vain to attempt to 
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X PREFACE T0< THE FIRST EDITION 

enumerate all the individuals who have contributed to the develop- 
ment of the science of statistics. Individual ' ref erences are pen 
in particular cases in the body of the text, but no list of sUch ac- 
knowledgments can serve as a complete record of the debt modem 

statisticians owe to their predecessors. 

For assistance in the preparation of the material contained in 

this book I am under many obligations. To Mr. H. E. pderson 
and Professor H. B. Killough I am indebted for certain of the dp 
employed in Chapters XI, XVI, and XVII. Professor Warren M. 
Persons of the Harvard Committee on Economic Research has 
courteously permitted me to make use of certain results ofEis work 
on commodity price index numbers. The index of industrial activ- 
ity presented in Chapter IX and utilized in Chapter XI is a product 
of the Statistical Division of the American Telephone aiid Tele- 
graph Company. I have employed it with the permission of 
Mr. Seymour L. Andrew, Chief Statistician, ’"^fcgestions from 
Professor A. H. Mowbray of the University of Calmnia have en- 
abled me to remove several obscurities that vrm 
earlier mimeographed edition. I am deeply grateh 
Henry L. Moore, Theodore H. Brown, and Henry'Sdhultz for their 
help in critically reviewing portions of the manuscript. For assist- 
ance at every stage of the work involved in the writing of this book 
I am under deep obligation to Professor Donald H. Davenport. 
His aid in the collection of materi^, in the preparation of charts, 
and in the onerous task of seeiil| the book through the press has 
been invaluable. To my wife, above all others, I ^hi indebted 
for a measure of constant and generous help that cannot be ade- 
quately acknowdedged here. 

^ ■■ . F. C. M. 


Present in an 
to Professors 
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CHAPTER I 


STATISTICAL METHODS AND THE PROBLEMS 
OF ECONOMICS AND BUSINESS 

The distinction between econoBiics and business rests upon 
viewpoint and approach, rather than subject matter. The 
economist and the business man have different objectives, 
but the substance of the science of economics and the mate- 
rials with which the art of business administration deals 
are in large part the same. In this treatise we are con- 
cerned with methods that may be employed in handling this 
common subject matter. 

Classes op Business Activity 

The tasks that confront business men may, without undue 
straining, be placed in three classes. First, in logical se- 
quence, are the technical tasks that arise in the processes 
of production, involving problems of chemistry and physics, 
of engineering, of animal husbandry, of navigation. The 
basic technical knowledge called for in the solution of these 
problems furnishes the foundation of our economic life. 
This is the domain of the hard-won arts of handling the 
raw materials and controlling the forces of nature. 

In the second class come activities that are connected 
with the internal organization and administration, of indi- 
vidual business units. The technical functions of manipulat- 
ing organic and inorganic matter for the satisfaction of 
human wants are performed through administrative units, 
single farms, mines, factories, railroads, department stores. 
A whole new division of problems is faced by the business 
man in organizing these units, in coordinating the work of 
the different departments, in supervising the daily activities 
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of the individuals making up each organization. While these 
are perhaps less fundamental than the technical problems of 
production, they are, for the average business man, more 
pressing and more difficult. Scientific method has made 
less progress in solving these latter problems. There is not 
the organized body of knowledge which is found in the 
former field, nor are there the same trained experts to whom 
the tasks may be delegated. 

The two t5rpes of econonaic activity named above include 
tasks that are in a sense self-centered and controllable. The 
manufacturer of steel has his technical problems of smelt- 
ing and refining, his particular administrative duties. The 
farmer or mine-owner faces the same types of problems, in 
forms peculiar to his own situation. In the performance of 
tasks in these fields each man is dealing with problems all 
the elements of which are under more or less perfect control. 
Difficulties arise, but these are ordinarily difficulties inherent 
in the given task, not difficulties arising from a sudden change 
in the constituent elements of the problem, or the sudden 
interjection of a new factor. In this respect the third cate- 
gory of tasks to be performed by the business man differs 
materially from the first two. For this class is composed of 
problems the elements of which are subject only in part 
to control by the individuals directly concerned. 

This third division includes buying and selling, and all ' 
the attendant activities that are carried on in terms of 
prices. As economic life is at present organized these func- 
tions are, to the business man, the most important ones he 
performs. The technical tasks of production and of internal 
organization and administration are but means to an end. 
For the business man the goal of economic activity is the 
disposal of his product at a profit. The tasks preliminary to 
this final sale are of necessity subordinated to it, and so 
performed that the final aim may be achieved. The point 
of emphasis here is that the business man, in buying and 
selling, faces problems containing elements which he cannot 
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control. In securing his raw material, in bringing together 
the other agents needed in production, and in the final dis- 
posal of his product, the business man deals with markets 
commodity markets, labor, mai’kets, money markets — and 
finds himself acting in ration to a system of prices quite 
beyond his control ii^: its.' major mc^ements. The other 
less fundamental phas% pf his activity are subject to a high 
degree of control, but when the business man comes to the 
final and most important act, the profitable sale of his prod- 
uct, his power of control dwindles. The motivating force in 
business activity is the hope of pecuniary profits, pecuniary 
profits depend upon successful buying and selling, successful 
buying and selling depend upon favorable conditions in an 
uncontrollable world of prices — here is the argument that 
states the major problem of business. And these are the 
facts which make the price system the dominating and 
all-important factor in modern business life. 

The modern entrepreneur lives in an environment of 
prices. The term “environment” is not an unapt figure; 
this world of prices in which the business man functions 
constitutes a coherent, consistent, well-articulated system 
of interdependent parts, a system which encompasses all 
the business activities of the entrepreneur. Since the system 
is beyond the control of the individual he must adapt him- 
self to it, and must base his activities upon as complete an 
understanding of the system as he may obtain. Without 
this understanding the major problems of business are in- 
capable of solution. 

Quantitative Chaeacter op Economic and Business 

Problems 

Problems falling in the first of the classes outlined above 
have long been recognized as essentially quantitative in 
character. Their solution calls for the application of the 
methods of precision which have been developed in the 
physical sciences. It is no less true that the strictly eco- 
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nomic and business problems falling in the other classes 
require the employment of quantitative methods. Quali- 
tative considerations enter, of course, in the solution of such 
problems, helping to determine the questions to be asked 
and the methods to be employed. But facts, measured, 
weighed and compared with other facts, constitute the basis 
of business judgments and the foundation of economic rea- 
soning. Statistical methods provide means of organizing 
and appraising these facts. 

Of the three classes of problems distinguished in the pre- 
ceding section two come within the scope of the present dis- 
cussion. Though the methods of statistics are in part ap- 
plicable to the solution of technical problems of production, 
it is not the purpose of the present work to develop this 
subject. For the solution of problems in the two other 
fields — those connected with the internal organization and 
administration of business units and wth the processes of 
buying and selling that bring the business man into contact 
with the price system — methods of statistical analysis are 
peculiarly appropriate. 

Statistical Methods and Problems of Internal 
Administration 

The typical business man, in the administration of his 
organization, is called upon to deal with masses of measure- 
ments. He is dealing with tons of coal, cubic feet of gas, 
or kilowatt hours of energy consumed; with tons of pig iron 
or pairs of shoes produced; with machine hours and man 
hours; with wages, costs of production and selling prices 
expressed in dollars and cents. With the increasing size of 
the business unit the data with which the administrator 
must deal become increasingly complicated and numerous, 
and it becomes increasingly difiicult to determine their true 
significance. Under intuitive or rule-of-thumb methods of 
administration it is impossible effectively to analyze large 
masses of data and to control business units above the 
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average in size. It has been abundantly demonstrated that 
the law of decreasing returns comes into play in business 
largely because of administrative difficulties. 

Whenever one deals with masses of data the problem is 
one of condensation and analysis — condensation and sim- 
plification in order that it may be possible for limited human 
faculties to handle the data, analysis (and comparison) in 
order that the elements of the problem may be distinguished 
and their significance appreciated. Statistical methods have 
been developed to facilitate the condensation and analysis 
of masses of quantitative data. 

As a typical example of such a problem may be mentioned 
the allocation of costs, an operation which has been called 
cost accounting. The proper analysis of all the factors 
which enter into this problem is only possible through the 
use of statistical methods. Accounting methods, restricted 
to the treatment of pecuniary units, are inadequate for the 
complete analysis of the items of expense. The analysis of 
sales records, again, calls for the condensation of masses of 
data, their representation in simple, understandable form, 
and their interpretation in relation to other business meas- 
urements. The analysis of markets and the study of purchas- 
ing records and conunodities require the use of quantitative 
methods not restricted in their application to any one class 
of measurements. At every hand in internal administration 
statistical methods may be used to supplement accounting 
methods, to extend the knowledge of the executive, and to 
make more effective the control of business operations. 

Statistical Methods and Exteenal Problems 

New problems are encountered when the business man 
goes into the market to buy or sell. Continually before him 
are the phenomena of business cycles, and if he is to adapt 
his producing and marketing policies to the swings of the 
cycle he must undertake the analysis of these phenomena, 
employing tools appropriate to the task. Again, the price 
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system, the movements of which are of such fundamental 
interest to the business man, requires analysis through the 
use of quantitative methods. So complex and numerous are 
the data to be dealt with here that simplification is impera- 
tive. Apart somewhat from the immediate interests of the 
business man, but of dominant importance to the economist, 
are all the problems connected with the economic process of 
distribution, the allocation of income and wealth among 
the agents of production. These, as well as that other great 
economic problem concerned with the question of value or 
price determination, are quantitative problems, to be solved 
through the use of quantitative methods of research. 

Statistical Proceditees in Research 

What are these methods, and wherein does research em- 
ploying such methods differ from other types of research? 
Scientific inquiry, whatever its particular method may be, 
proceeds through careful observation, logical inference and 
accurate verification. Quantitative methods differ from 
others only in that observation, inference, and verification 
are based upon measurement. Until measurement is possible 
in a science it is unavoidable that its observations and find- 
ings should lack precision, no matter how brilliant the flashes 
of intuition nor how painstaking the labors of its students 
may be. The employment of methods of measurement, 
making possible the analysis of the factors involved in terms 
of precise units, gives to a science some of the advantages 
that sharp-edged tools have over blunt and unreliable instru- 
ments. Mathematics and its offspring, statistics and ac- 
counting, are the powerful mstruments which the modern 
economist has at his disposal, and of which business, through 
the development of research agencies and methods, is making 
constantly greater use. 

The tools of the statistician are merely certain mathe- 
matical methods, developed for particular t3q3es of research. 
These types of research were not economic in the original 
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development of statistical methods, but social, political, and 
anthropometric, with one hne of development (that relating 
to the theory of probabilities) extending back through the 
field of logic to the gaming table. Yet these tools, developed 
for work in restricted spheres, have been found to possess 
much wider applicabihty, and economics has been one of 
the newer fields in which the application of these methods, 
with appropriate alterations and additions, has had fruitful 
results. The economist has found his hand strengthened 
and the precision of his work materially increased by the 
new tools. And business, together with the more abstruse 
science of economics, has profited. 

Reference has been made to the possibility of condensa- 
tion and simplification through the use of statistical pro- 
cedures. Such simplification is of cardinal importance in 
economics and in the other social sciences today. These 
sciences, to be realistic, must be scrupulously faithful to 
fact, yet the masses of facts relating to current social proc- 
esses are, in their magnitude, almost a menace to effective 
analysis. “Already,” writes a reviewer in the Journal of the 
Royal Statistical Society, “economic analysis taxes language 
to its utmost, and it is a question how much longer mere 
verbal exposition will be able to control the swelling floods 
of observable data.” Though one may feel that these floods 
of data fail to provide many of the essential facts about 
social processes today, there is point to the reviewer’s com- 
plaint. In the light of this danger systematic procedures 
in the organization and analysis of data have an importance 
today that they did not have at an earlier time. Statistical 
methods constitute such procedures. By their use we may 
seek to channel and appraise the floods of data, relating to 
business operations and other social processes, that the 
fact-gathering agencies of business and government currently 
release upon us. 
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GRAPHIC PRESENTATION 

The explanation of methods of condensing, analyzing, and 
interpreting the facts of business and economics must start 
with the discussion of some fundamental considerations 
which ai-e mathematical rather than statistical in character. 
In doing so it is deemed advisable, even at the risk of tread- 
ing quite familiar ground, to explain certain mathematical 
conceptions to which constant reference will be made in 
later chapters. 

Statistical analysis is concerned primarily with data based 
upon measurement, expressed either in pecuniary or physical 
units. The methods of coordinate geometry, developed first 
by the philosopher Descartes, greatly facihtate the manipu- 
lation and interpretation of such data. A summary of the 
basic principles of coordinate geometry will not be out of 
place. 

Rectangular Coordinates 

If two straight lines intersecting each other at right • 
angles are drawn in a plane, it is possible to describe the 
location of any point in that plane with reference to the 
point of intersection of the two lines. We will call one of 
the lines (a vertical line) F'F, the other line (horizontal) X'X, 
and the point of intersection (or origin) 0 (cf. Fig. 1). If P 
be any point in the plane, we may draw the line PM, par- 
allel to Y'Y and intersecting X'X at M, and the line PX, 
parallel to X'X and intersecting F'F at N. If we set OM 
equal to g units and ON equal to h units, g and h constitute 
the coordinates of P, describing its location with reference 
to the origin 0. Thus, in Fig. 1, g equals 6 and h equals 5. 
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The distance g along the rc-axis is termed the abscissa of 
the point P, while the distance h along the s/-axis is termed 
the ordinate of the point P. (It is a rule of notation always 
to give the abscissa first, followed by the ordinate.) The 
coordinates of any other point in the same plane may be 



Fig. 1. — Location of a Point with Reference to Rectangular CoSrdinates 


determined in the same way. Conversely, any two real 
numbers determine a point in the plane, if one be taken as 
the abscissa and the other as the ordinate. 

A point may lie either to the right or left or above or 
below the origin, 0. It is conventional to designate as 
positive abscissas laid off to the right of the origin, and as 
negative abscissas laid off to the left of the origin, while 


f 
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3^0 

V ic^rt nff above the origin ana 

ordinates are ’ Se oriP'^- 

negative when laid o« e ^ statistics lie in the 

“^Tr‘e»ceptien of co—s isj— -i 

matics and of ^+g\he utility of this device m 

simple example wiU '“f g* ^s presented in the 

representing busmess data.^ The hgme 

following table may be employed. 

Table 1 

Predach^ «/ 

Months, During the Year W6l 

Number of 
passenger cars 
manufactured 

309,637 
296,636 
403,879 
439,980 
. 425,432 
411,394 
360,403 
311,456 
118,671 
298,662 
295,328 
244,385 

December _ xv, on- 

These data onlhngthe a:-axis and 

ordinate system, months S . ^he accom- 

,^ber of -^-"Tn\t?tingth;absc^^^^ 
panying diagram frig- -“b b + j <,t the noint of origin. 

Lr. 1936 . is considered as ‘oea WsK to 1 of 

The a-value ‘’j® “‘"jJ^The codrtotes of the point 
the February figure in January, 1937 , 

representing the n^ber j^ P ^ 296 . 636 . 

SrJr are 12 and 241 , 385 . The 


Month 

January 

February 

March 

April 

May 

June 

July 

August 

September 

October 

November 

December 
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movement of automobile production during the year may 
be more easily followed if the points are connected by a 
■ series of straight lines, as is done in the figure. 

Independent and Dependent Variables 
In the location of any point by means of coordinates it 
has been pointed out that two values are involved; every 
point ties together and expresses a relation between two 
factors. In the above case these are months and number of 
passenger automobiles produced. With the passage of time 
the volume of automobile production changes, and the 
broken line shows the direction and magnitude of these 
changes. Both time and number of cars produced are vari- 
ables, that is, they are quantities not of constant value but 
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characterized by variations in value in the given discussion. 
Thu 5 in Fig. 1 the abscissa has a fixed value of 6, while the 
ordinate has a fixed value of 5, but in Fig. 2 both abscissa 
and ordinate have varying values, the one varying from 1 
to 12, the other from 118,671 to 439,980. The symbols x 
and y are, by convention, used to designate such variable 
quantities as these, the former in all cases representing the 
variable plotted along the horizontal axis, the latter rep- 
resenting the variable plotted along the vertical axis.^ 

In Fig. 2, which depicts the changes taking place in 
automobile production with the passage of time, it will be 
noted that the latter variable changes by an arbitrary unit, 
one month. Having made an independent change in the 
time factor we then determine the change in price taking 
place during the period thus arbitrarily chopped out. The 
variable which increases or decreases by increments arbi- 
trarily determined is called the independent variable, and is 
generally plotted on the a;-axis. The other variable is termed 
the dependent variable, and is plotted on the y-axis. This 
dependence may be real, in the sense that the values of the 
second variable are definitely determined by the values of 
the independent variable, or it may be purely a conven- 
tional dependence of the type described. Time, it should 
be noted, is always plotted as independent, when it consti- 
tutes one of the variables. 

Functional Relationship 

When the relationship between two variables is one of 
complete dependence, so that the value of y is imiquely 
determined by a given value of x, y is said to be a function 
of X. The general expression for such a relationship is 
y = f{^)- Thus the speed at which a body is falling at a 

^ It should be noted that letters at the end of the alphabet are used as 
symbols for variaUeSj w^hile letters at the beginning of the alphabet are used as 
symbols for comtants^ i.e., quantities the values of which do not change in the 
given discussion. 
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given moment is a function of the time it has been falling, 
the pressure of a given volume of gas is a function of its 
temperature, the increase of a given principal sum of money 
at a fixed rate of interest is a function of time. If the values 
of the independent variable be laid off on the x-axis of a 
rectilinear chart and the corresponding values of the func- 
tion (i.e., the dependent variable) be laid off on the y-axis, 
a graphic representation of the function will be secured, in 
the form of a curve. ^ This concept of functional relationship 
is a very important one in statistical work. Some of the 
simpler functions may be briefly discussed. 

THE STRAIGHT LINE 

If two variables are so related that their values are always 
the same, their relationship is obviously of the form y = x. 
As a very simple example, the relation between the age of a 
tree and the number of rings in its trunk may be cited. 
A tree 6 years old will have 6 rings, for 20 years there will 
be 20 rings, and so on. This relationship may be represented 
on a coordinate chart, several sample values of x and y 
being taken. When these points are plotted and a line drawn 
through them, we secure a straight line passing through the 
origin and (assuming the two scales to be equal) bisecting 
the right angle XOY (cf. Fig. 3). 

- Similarly, any equation of the first degi-ee (i.e., not in- 
volving xy, or powers of x or 2/ above the first) may be rep- 
resented by a straight line. The generalized equation can 
be reduced to the form y = a hx, where a is a constant 
representing the distance from the origin to the point of 
intersection of the given line and the y-axis, and h is a con- 
stant representing the slope of the given line (that is, the 
tangent of the angle which the line makes with the hori- 
zontal). The constant term a is called the y-intercept. It is 
clear from the generalized equation of the straight line that 

^ The general term “curve’' is used to designate any line, straight or curved, 
when located with reference to a coordinate system. 
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when X has a value of zero, y will be equal to this constant 
term. In the example given above (Fig. 3) a is equal to 0, 
and & to 1. The location of a given line depends upon the 
signs of a and h as well as upon their magnitudes. The prac- 
tical problem involved in the determination of any straight 


Fig. 3. — Graph of the Equation y — x 

line is that of finding the values of a and h from the data, 
a problem which will appear in various forms in the discus- 
sion of statistical methods. 

These points may be illustrated by the plotting of a 
simple equation of the first degree. Thus, to construct 
the graph of the function, y = 2 + Zx, various values of 
X are assumed, and corresponding values of y ai-e deter- 
mined. These may be arranged in the form of a table: 
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X 

- 4 

- 2 
0 
2 
4 


V 

(2 + 3x) 
- 10 
- 4 
2 
8 
14 


Plotting these values and connecting the plotted points, 
the graph illustrated in Fig. 4 is secured. It will be noted 



Fig. 4. — Graph of the Equation y = 2 -{-Zx 


that since this function is linear (that is, the graph takes 
the form of a straight line) any two of the points would have 
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been sufficient to locate the line. The 2 /-intercept is equal 
to the constant term 2, and the tangent of the angle which 
the given line makes with the horizontal (the slope of the line) 
is equal to 3, the coefficient of x. That this curve repre- 
sents the equation is proved by the fact that the equation is 
satisfied by the coordinates of every point on the curve, and 
that every pair of values satisfying the equation is represented 
by a point on the curve. It is characteristic of a linear 
relationship that if one variable be increased by a constant 
amount, the corresponding increment of the other variable 
will be constant. In the above ease as x grows by constant 
increments of 2, for example, the constant increment of the 
^/-variable is 6. Series which increase in this way by con- 
stant increments are termed arithmetic series. 

Many examples of linear relationship between variables 
are found in the physical sciences. An example from the 
economic world is found in the growth of money at simple 
interest, that is, interest which is not compounded. If we 
let f represent the rate of simple interest, x the number of 
years, and y the sum to which one dollar will amount at 
the end of x years, the equation of relationship is of the 
form 

y — 1 -I- rx. 

Since in a given case r will be constant, this is of the simple 
linear type. In statistical work precise relationships of 
this type rarely if ever occur, but approximations to the 
straight line relationship are found constantly. 

NON-IilNEAR RELATIONSHIP 

Non-linear functions are of many types, of which only 
a few of the more common will be discussed here. The 
student should be familiar with the general characteristics 
of the chief non-periodic curves, of which the parabolic 
and hyperbolic types, on the one hand, and the exponential 
type on the other, are the most important. The potential 
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series is mentioned as a more general form of rather wide 
utility. Of periodic functions the sine curve is briefly de- 
scribed, as a fundamental form. 

Functional relationships of the parabolic or hyperbolic 
form are quite common in the physical sciences, and such 
curves are found to fit certain classes of economic data. 
The general equation, when there is no constant term, is 
of the form y = axK The curve is parabolic when the ex- 
ponent h is positive, and hyperbolic when b is negative. 
The tw'o following examples will serve to illustrate these 
types: 

Problem: To construct the graph of the function y = x^. 


X 

y 



- 5 

25 

- 4 

16 

- 3 

9 

- 2 

4 

- 1 

1 

0 

0 

1 

1 

2 

4 

3 

9 

4 

16 

5 

25 


The graph is shown in Fig. 5. 

Problem: To construct the graph of the function y = x-^, 
for positive values of x. 

X 


¥ 

1 

1 

2 

3 

4 

5 


y 

(x-') 

3 

2 

1 

X 

1 

i 
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Fig. 5. — Parabola: Graph of the Equation y = 

The graph of this function, an equilateral hyperbola, is 
shown in Fig. 6. It should be noted that this equation 

may also be written y = - or xy = 1. 

It is characteristic of relationships of this type that as x in- ' 
creases in geometric progression, y also increases in geometric 
progression. Thus, in the example of the parabola given above 
(y = x^), if we select the x values which form a geometric 
series,^ the corresponding y values form a similar series: 

X 1 2 4 8 16 32 

y 1 4 16 64 256 1,024 

Another class of functions is of the form represented by 
the equation y = ab^. In equations of this type one of the 
variable quantities occurs as an exponent; graphs repre- 

^ A geometric series is one each term of which is derived from the preceding 
term by the application of a constant multiplier. 




Fig. 6. — Equilateral Hyperbola: Graph of the Equation y = x~'- 
(for positive values of x) 


senting such equations are calied exponential curves. 
example which follows illustrates the type. 

Problem: To construct the graph of the function y 
for positive values of r. 

X y 

(2^) 

0 1 

1 2 

2 4 

3 8 

4 16 

5 32 

6 64 


The 
= 2 *, 
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This graph is shown in Fig. 7. 

It has been noted that the relationship between two 
variables which increase by constant increments (consti- 
tuting arithmetic series) may be represented by a straight 
line, and that the relationship between variables increasing 



Pig. 7. — Exponential Curve: Graph of the Equation ?/ = 2“’ (for posi- 
tive values of x) 

in geometric progression may be represented by either a 
parabola or a hyperbola. The exponential curve constitutes 
a hybrid type. It describes a relation in which one variable 
increases in arithmetic progression while the other increases 
in geometric progression. The figures given above illustrate 
this relationship. 
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Curves based upon relationships of the following type have 
■been employed extensively in statistical inquiries : 

^=a+bx+cx^+dx^+ .... 

The term potential series has been applied to equations of 
this type. Though such curves do not constitute parabolas 
of the strict conic section type, a curve based upon such 
an equation carried to the second power of x is termed a 
second degree parabola, to the third power of x, a third 
degree parabola, etc. No uniform and simple type is secured 
from this series. It is treated in more detail at a later point. 

Periodic fxmctions constitute another distinct type, a class 
represented notably by electrical and meteorological rela- 
tions, though not confined to these fields. The character- 
istic feature of such relationships is that values of the de- 
pendent variable repeat themselves at constant intervals 
of the independent variable. The sine curve, the basic type 
of this class, is illustrated in the following example. 

Problem: To construct the graph of the function y = sin.t;. 


X 

y 

(angle in degrees) 

(sin x) 

0° 

.000 

30° 

.500 

60° 

.866 

90° 

1.000 

120° 

.866 

150° 

.500 

180° 

.000 

210° 

- .500 

240° 

- .866 

270° 

- 1.000 

300° 

- .866 

330° 

- .500 

360° 

.000 

390° 

.500 

etc. 


The graph is shown in Fig. 8. 
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The full importance in statistical work of securing a 
mathematical expression for the relation between two vari- 
ables cannot be demonstrated until the subject has been 
further developed. One fundamental object is the deter- 
mination of physical or economic laws underlying observed 
phenomena. Another more practical object is the securing 



of a formula by means of which values of one variable may 
be approximated from given values of the other. Examples 
throughout the book will serve to illustrate how these objects 
are attained.^ 

Logarithms 

Logarithms, which play such an important part in gen- 
eral mathematical operations, are of equal importance in 
the manipulation of the raw materials of statistics. The 
nature of logarithms, and the methods by which they are 
employed to facilitate arithmetic processes, may be briefly 

^ A fuller discussion of different curve types is presented below, in the section 
dealing with the analysis of time series. 
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reviewed. This discussion is concerned only with the common 
system of logarithms of which the base is 10. 

Any positive number may be expressed as a power of 10. 
Thus 

1,000 = 10 X 10 X 10 = 10» 

10,000 = 10 X 10 X 10 X 10 = 10^ 

In each case the exponent of 10 (the small number written 
above and to the right) indicates the number of times the 
figure 10 is repeated as a factor. For the integral powers 
of 10 the exponent is a whole number, but for the other 
numbers the exponent will contain a fractional value. Thus 
100 is equal to 10 raised to the power 2, or 10^; 110 is equal 
to 10 raised to the power 2 . 04139, or 10^"^’^®*’. 

The exponent of 10, or the index of the power to which 10 
must be raised to equal a certain number, is called the loga- 
rithm of that number. The logarithm of 100 is 2, the logarithm 
of 110 is 2.04139, the logarithm of 998 is 2.99913. These 
figures all have reference to the base 10, though a system 
of logarithms might be developed on any base. In general, if 

<1 = 6 “ 
log6 a — c 

which may be read “the logarithm of a to the base h is 
equal to c.” The relation between the given number, the 
base and the logarithm, when the common system of loga- 
rithms is employed, may be easily remembered if the follow- 
ing relations are kept in mind : 

100 = 10 ^ 
logio 100 = 2. 

The logarithm of any number has two parts, the integral 
and the decimal. The whole number is called the charac- 
teristic, and the decimal portion is termed the mantissa. 
The former is determined in a given case by inspection, 
while the mantissa may be obtained from logarithmic tables. 
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The characteristic varies with the location of the decimal 
point, while the mantissa remains the same for any given 
combination of numbers. This fact is illustrated by the 
following figures: 

log of 8,450 = 3.92686 

log of 845 = 2.92686 

log of 84.5 = 1.92686 

log of 8.45 = .92686 

log of .845 = 9.92686 - 10 

log of .0845 = 8.92686 - 10. 

In finding the natural number to which a given logarithm 
corresponds (such natural numbers are termed anti-loga- 
rithms), the mantissa determines the sequence of figures, 
while the whole number, or characteristic, determines the 
location of the decimal point. For example, in seeking the 
anti-logarithm of 2.17609 it is found that the decimal 
. 17609 follows the natural number 1,500, in a table of 
logarithms. Since the characteristic is 2, the natural num- 
ber desired must lie between 100 and 1,000, and must there- 
fore be 150. 

A brief study of the following figures, showing the pro- 
gression of numbers corresponding to certain powers of 10, 
will help to fix in mind the relations between the multiples 
of 10 and their logarithms, and will enable the characteristic 
of a desired logarithm to be readily determined. 

.0001 .001 .01 .1 1 10 100 1,000 10,000 

10-^ 10-3 10-2 10-1 loo 101 102 los 104 

The exponents of 10 in the lower row are the logarithms 
of the numbers in the upper row. 

It should be noted that the logarithms of ail numbers 
from 0 to 1 ai'e negative. Thus the logarithm of .845 is 
— 1 . 92686 ; this is written 9 . 92686 — 10. In covering 
the range of all positive natural numbers from zero to infin- 
ity, logarithms traverse all positive and negative values. 
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A negative natural nunaber, therefore, can have neither a 
positive nor a negative logarithm. 

The advantage of thus expressing numbers as powers of 
10 hes in the fact that the ordinary arithmetic operations of 
multiplication, division, raising to powers and extracting 
roots are greatly facihtated by this procedure. 

To multiply numbers, add their logarithms. The sum 
of the logarithms of the factors is the logarithm of their 
product. In general terms : 


a** X a' = 


Specifically 


10= X 10® = (10 X 10) X (10 X 10 X 10) = 10^ = 100,000 
100 X 1,000 = 100,000. 

To divide one number by another, subtract the loga- 
rithm of the latter from the logarithm of the former. The 
remainder is the logarithm of the desired quotient. 

In general terms: 


Specifically 
108 ^ 10 = 


= al8-c)_ 


10 X 10 X 10 X 10 X 10 


10 X 10 


100,000 100 


= 10 ®= 1,000 
= 1 , 000 . 


To raise a given number to any power, multiply the 
logarithm of the number by the index of the power. The 
product is the logarithm of the desired power. 

In general terms: 

(a^)' = a^. 

Specifically 

(108)2 == (10 X 10 X 10) X (10 X 10 X 10) = lO® = 1,000,000 

1 . 000 = = 1 , 000 , 000 . 

To extract any root of a given number, divide the loga- 
rithm of the mnnber by the index of the root. The quo- 
tient is the logarithm of the desired root. 
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In general terms: 

= a®. 

Specifically 

= lOi- = 10^ = 100 

^ 1 , 000,000 = 100 . 

In summary: 

log (a X 6) = log a + log & 
log (a T- 6) = log a — log b 
log a* = 6 X log a 
log -^a = log a -r- b. 

These characteristic advantages of logarithms have been 
made use of in the construction of the slide rule, an instru- 
ment for reducing routine toil which should be familiar to 
all students of statistics. 

LOGARITHMIC EQUATIONS 

The graphic representation of data by means of a system 
of rectangular coordinates has been described above and 
some of the advantages of this method have been outlined. 
For many purposes it is desirable to plot logarithms rather 
than the natural numbers themselves. This may result in 
bringing out significant relations more distinctly, or it may 
serve greatly to simplify and facilitate the manipulation of 
data. In particular, when it is possible through the use of 
logarithms to reduce a complex curve to the straight line 
form, a distinct gain has been made in the direction of 
simplicity of treatment and interpretation. 

A linear equation, it will be recalled, is of the general 
form y = a + bx, where a and b are constants which meas- 
ure, respectively, the i/-intereept of the given line and the 
slope. The simplification of equations through the use of 
logarithms involves in all cases the substitution of log x or 
log y, or both, for the x or y variables, thereby reducing an 
equation of a higher order to a simpler form. 
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This process may be illustrated with reference to the 
equation y = xl I^en plotted on rectangular coordinates 
this equation gives a curve of the parabolic type (cf. Fig. 5). 

Natural Numbers 


0 .30103 .60206 .90309 1.20412 1.50515 1.80618 

Scale of Logarithms 

Fig. 9. — Graph of the Equation log 2 / = 2 log x (Logarithmic form of 
the equation y = x^) 

Reduced to logarithmic form this becomes log y = 2 log x. 
This equation, in which the variables are logy and logx, 
is linear in form. It is plotted in Fig. 9, for positive values 
of log X. To indicate the relations involved, natural numbers 
corresponding to the logarithms are given on scales to the 
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right and at the top of the figure. The natural numbers 
appearing on the scales constitute geometric series, while 
their logarithms form arithmetic series. Equal vertical 
distances on the chart, it will be noted, represent equal 
absolute increments on the scale of logarithms and equal 
percentage increments on the scale of natural numbers. 

The equation y = fix® can be reduced in the same way 
to log 1 / = log 6 + 3 log X, a Ihiear form. Similarly, all 
equations of the type y = ax^, that is to say, all simple 
parabolas and hyperbolas, can be reduced to the straight 
line form log y = log a + 6 log x. Graphically this means 
plotting the logarithms of the y’s against the logarithms of 
the x’s. 

A different problem is presented by an equation of the 
type y = ah^, the graph of which is termed an exponential 
curve. Expressed in logarithmic form, we have log y = 
log a + a: log h. This is also of the Imear type, the two con- 
stants being log a and log h, while the variables are x 
and log y. If we plot the natural r’s and the logs of 
the y’s, with this type of equation, a straight line will be 
secured. A curve of this type is discussed and illustrated 
below. 

LOGARITHMIC AND SEMI-LOGARITHMIC CHARTS 

There are certain disadvantages to the plotting of loga- 
rithms, however. If a considerable number of points are 
being plotted the task of looking up the logarithms may be 
tedious, and, in addition, the original values, in which 
chief interest lies, will not appear on the chart. These 
difficulties may be avoided by constructing charts with the 
scales laid off logarithmically, but with the natural numbers 
instead of the logarithms appearing on the scales. This is 
an arrangement identical with that employed in the con- 
struction of slide rules. Thus, although the natural numbers 
are given on the scales, distances are proportional to the 
logarithms of the numbers thereon plotted. In Fig. 10 
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such a chart is presented, showing the graph of the equa- 
tion y = z^. 

A variation of this type of chart which is of great im- 
portance in statistical work is one which is scaled arith- 
metically on the horizontal axis and logarithmically on the 
vertical axis. This is equivalent, of course, to plotting the 



Pig. 10. Graph of the Equation y = (Plotted on paper with 
logarithmic scales) 

a;’s on the natural scale and plotting the logarithms of the 
y’s. As was pointed out above, such a combination of 
scales reduces a curve of the exponential type to a straight 
line. Plotting paper of this semi-logarithmic or "ratio” 
type may be constructed with the aid of a slide rule or 
of logarithms, or may be purchased ready made. It is of 
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particular value in charting economic statistics, because of 
the fact that time is usually one of the variables in such 
cases, and it is desirable to plot this variable on the natural 
scale. 

Dollars 



Fig. ii. — The Compound Interest Law: Growth of $10.00 at Compound 
Interest at 6 per cent for 100 Years (Plotted on arithmetic scale) 


As an example of this type of curve the compound inter- 
est law may be used. If r be taken to represent the rate 
of interest, x the number of years, p the principal, and 
the sum to which the principal amounts at the end of x 


LOGARITHMIC CHARTS 


31 


years (interest being compounded annually) /an equation is 
secured of the form 

y = p(l + r)^ 

Expressed logarithmically this becomes 

log 2 / = log p + a: log (1 + r), 
the equation to a straight line. 

Dollars 



pound Interest at 6 per cent for 100 Years (Plotted on semi-logarithmic 
or ratio scale) 

In Fig. 11a curve representing the growth of ten dollars 
at compound interest at 6 per cent is plotted on the natural 
scale. This is the graph of the exponential equation 

y = 10(1 + .06)^ 

y representing the total amount of principal and interest 
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at the end of x years. Figure 12 shows the same data plotted 
on semi-logarithmic paper, the exponential curve being re- 
duced to a straight line. 

The use of semi-logarithmic paper is not confined to 
eases in which an exponential curve is straightened out, 
for the significance of many types of data is most effectively 
brought out when charts of this type are used. These 
advantages are more fully explained below. 

The Construction of Charts 

When the results of observations or statistical investi- 
gations have been secured in quantitative form, one of the 
first steps toward analysis and interpretation of the data 
is that of presenting these results graphically. Not only is 
such procedure of scientific value in paving the way for 
further investigation of relationships, but it serves an im- 
mediate practical purpose in visualizing the results. A visual 
stimulus opens up a far more dhect path to our understand- 
ing and imagination than that afforded by the more recently 
developed processes of reasoning. The interpretation of a 
column of raw figures may be a difficult task; the same data 
in graphic form may tell a simple and easily understood story. 
For these reasons graphic methods of presentation have come 
to play a highly important part in the everyday activities 
of business, as well as in the laboratory and drafting room. 

It is beyond the scope of this book to present any detailed 
account of the multiplicity of graphs employed by engineers 
and statisticians today. Certain of the more important 
principles of graphic presentation may be briefly explained, 
however, and some of the chief types of graphs which are 
in daily use may be illustrated. Other examples appear in 
later chapters of this book. • 

factors governing the selection of a chart 

The selection of the type of chart to be employed in a 
given case will depend upon two general considerations. 
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The first of these relates to the character of the material 
to be plotted. While the data of a given problem may 
frequently be presented graphically in several different 
forms, there is generally one type of chart best adapted 
to that material. It may be true, also, that certain types 
would be quite inappropriate to the data in question. The 
selection of a type of chart to employ, therefore, must 
be made with the characteristics of the data clearly in 
mind. 

Perhaps more important is the purpose which the given 
chart is designed to serve. Each of the many types 
of charts in common use is appropriate to certain speci- 
fic purposes. It will bring out certain characteristics of 
the data or will emphasize certain relationships. There 
is no chart which is sovereign for all purposes. Until the 
purpose is clearly defined the best chart form cannot 
be selected. The following descriptions of a few stand- 
ard types will facilitate the selection of an appropriate 
form. 

CHARTS ADAPTED TO THE PLOTTING OP TIME SERIES 

In the graphic presentation of a time series, primary 
interest attaches to the chronological variations in the values 
of the data, to the general trend and to the fluctuations 
about the trend. If the purpose is to emphasize the absolute 
variations, the differences in absolute units between the 
values of the series at different times, a simple chart of the 
t5q)e illustrated in Fig. 13 will serve the purpose. This 
chart depicts annual wheat flour exports from the United 
States during the period 1913-1936. Both scales are arith- 
metic. Points representing the various annual values are 
shown and, to facilitate interpretation, these points are 
connected by a series of straight lines. The chart tells a 
simple story of year-to-year fluctuations, with a sharp ad-, 
vance at the end of the World War, a decline as the post-war 
emergency passed, several years of moderate growth, and a 
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severe decline as the world depression deepened in the early 
thirties. With respect to general make-up, the following 
points should be noted: 

1. The title constitutes a clear description of the material plotted 

and indicates the period covered. 

2. The vertical scale begins at the zero line, enabling a true 

impression to be gained of the magnitude of the fluctua- 
tions. 

3. The zero line and the line joining the plotted points are ruled 

more heavily than the coordinate lines. 

4. Figures for the scales are placed at the left and at the bottom 

of the chart. The vertical scale may be repeated at the 
right to facilitate reading. All figures are so placed that 
they may be read from the base as bottom or from the 
right hand edge of the chart as bottom. 
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5. The jz-values of the plotted points are given at the top of the 
chart. This practice is helpful, though not necessary, as 
the values may be presented in a separate table. 

ADVANTAGES OF THE RATIO CHART 

If relative rather than absolute variations are of chief 
concern, the chart employed should be of the semi-loga- 
rithmic type, scaled logarithmically on the y-axis and arith- 
metically on the a;-axis. In such a chart equal percentage 
variations are represented by equal vertical distances, as 
opposed to the ordinary arithmetic type in which equal 
absolute variations are represented by equal vertical dis- 
tances. The argument for the use of the semi-logarithmic 
or ratio chart for the representation of time series is that, 
in general, the significance of a given change depends upon 
the magnitude of the base from which the change is meas- 



ured. That is, an increase of 100 on a base of 100 is as 
significant as an increase of 10,000 on a base of 10,000. In 
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each case there is an increase of 100 per cent. The absolute in- 
crease in the second case is 100 times that in the first case, 
and the two changes would show in this proportion on the 
arithmetic chart. They would show as of equal importance 
on the semi-logarithmic chart. 

Such a chart is presented in Fig. 14, which shows the 
course of steel production in the United States from 1896 
to 1936. The absolute magnitudes are plotted, but the 
vertical scale is so constructed as to represent variations 
from year to year in proportion to their relative magnitude. 



Fig. 15. — Exports of the United States, 1920-1936 Showing Total Ex- 
ports and Exports to Selected Areas (Monthly averages for the years 
named are plotted on an arithnaetic scale) 


Certain distinctive advantages of the ratio or logarithmic 
ruling are brought out by a comparison of Fig. 15 and 
Fig. 16. The data presented graphically in these two charts 
are shown in Table 2 : 
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Table 2 


Exports of the United States, 1920-1936 
(Monthly averages, in thousands of dollars) 



To 

To 

To 

To 

Total to 

Grand 


France 

Germany 

Italy 

XJ n/ited 
Kingdom 

Europe 

total 

1920 

$56,349 

$25,953 

$30,980 

$161,319 

$372,174 

$685,668 

1921 

18,745 

31,027 

17,955 

78,510 

196,992 

373,753 

1922 

22,247 

26,343 

12,575 

71,319 

173,613 

319,315 

1923 

22,678 

26,403 

13,961 

73,527 

174,451 

347,291 

1924 .. 

23,472 

36,702 

15,595 

81,912 

203,775 

382,582 

1925 

23,358 

39,195 

17,096 

86,155 

216,979 

409,154 

1926 

22,000 

30,347 

13,117 

81,051 

192,512 

400,722 

1927 

19,066 

40,140 

10,971 

70,005 

192,576 

405,448 

1928 

20,058 

38,938 

13,510 

70,613 

197,912 

427,363 

1929 

22,133 

34,204 

12,831 

70,667 

195,070 

435,083 

1930 

18,663 

23,189 

8,369 

56,509 

153,198 

320,265 

1931 

10,152 

13,838 

4,568 

37,923 

95,040 

202,024 

1932 

9,297 

11,139 

4,095 

24,027 

65,358 

134,251 

1933 

10,143 

11,669 

5,103 

25,978 

70,815 

139,583 

1934 

9,642 

9,062 

5,381 

31,896 

79,150 

177,733 

1935 

9,751 

7,665 

6,035 

36,117 

85,770 

190,240 

1936 

10,795 

8,382 

4,900 

36,662 

86,694 

204,457 


(Data compiled by Bureau of Foreign and Domestic Commerce, U. S. 
Department of Commerce.) 


If the six series are to be presented on a single chart, 
scaled arithmetically, a scale must be selected which will 
include the largest item recorded, $685,668,000. Such a 
scale reduces the relative importance of the smaller magni- 
tudes. From Fig. 15 it appears that during the period cov- 
ered by the chart very large fluctuations occurred in total 
exports, substantial but somewhat smaller movements oc- 
curred in exports to Europe, and that exports to the four 
individual countries suffered much less severe fluctuations. 
Such a picture is quite misleading. The true state of affairs 
is reflected in Fig. 16, in which the same data are plotted 
on paper with a semi-logarithmic ruling. Fluctuations in 
exports to the individual countries are here seen to have 
been relatively greater than the movements of total exports. 
For the purpose of comparing series which differ materially 


Millions of Dollars 
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with respect to the magnitude of the individual items, the 
arithmetic ruling is quite useless, giving a thoroughly dis- 
torted pictme of the true relations. The ratio ruhng permits 
a legitimate comparison. 

The scales printed below Fig. 16 emphasize certain very 
useful features of the logarithmic ruling. The scale of in- 
crease may be used to measure with a fair degree of accu- 
racy the increase in a given series between any two dates. 
A given vertical distance on the chart, it will be recalled, 
represents a constant percentage increase at all points on 
the chart. Thus the distance from 1 to 10, along the vertical 
scale, is the same as the distance from 100 to 1,000. Any 
vertical distance may be measured, and the percentage of 
increase which it represents may be determined by laying 
off the given distance along the scale of increase, which is 
always read from the bottom up. For example, to determine 
the degree of increase in total exports from 1932 to 1935, 
we measure the vertical distance between the points plotted 
for these two years. Laying off this distance along the scale, 
it is found to represent about a 40 per cent increase. 

The scale of decrease is used in a similar fashion. The 
vertical distance between any two points is measured, and 
the percentage decrease which it represents is determined 
by laying off the given distance on the scale from the top 
downward. The arrows indicate the direction in which the 
various scales are to be read. 

By means of the scale of comparison the percentage relation 
of one series to another at any time may be determined. 
For example, we may wish to know the percentage relation 
between exports to Europe and total exports in 1935. The 
vertical distance between the two plotted points is measured, 
and laid off on the scale of comparison, reading from the top 
downward. It is found to be approximately 45 per cent. 

Scales of the type illustrated above may be readily con- 
structed on a given chart by using the ratio ruling for the 
scale intervals. When a series of charts is prepared on 
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semi-logarithmic paper of a standard type it is convenient 
to construct such scales in a more permanent form, in the 
shape of special rulers. 

A ratio chart is particularly useful when interest attaches 
to rates of growth (or decline) over a considerable period of 
time. In such a case, the reading of the chart is facilitated 
by the plotting of straight diagonal lines indicating uniform 



Fig. 17. — Production of Rayon Filament Yam in the United States, 
1912-1937, With Lines Defining Uniform Rates of Growth 



rates of change. These should radiate from a single point of 
origin. The procedure is illustrated in Fig. 17. The diagonal 
lines there shown indicate changes at uniform rates ranging 
from 10 per cent to 50 per cent per year. By reference to 
these lines the user of the chart may readily determine the 
approximate rate of growth of the plotted series between 
any two years. 

The chief advantages of the semi-logarithmic ruling in 
chart construction may be briefly summarized : 

1. A curve of the exponential type becomes a straight line when 
plotted on a semi-logarithmic chart. For example, a curve 
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representing the growth of any sum of money at compound 
interest takes the form of a straight line when so plotted. 

2. In any series, so long as the rate of increase or decrease remains 

constant the graph will be a straight line on this ruling. 

3. Equal relative changes are represented by lines having equal 

slopes. Thus two series increasing or decreasing at equal 
rates will be represented by parallel lines. 

4. Comparison of the rates of change in two or more series is 

effected by comparison of the slopes of the plotted lines. 

5. The semi-logarithmic ruling permits, at the same time, the 

plotting of absolute magnitudes and the comparison of 
relative changes. 

6. Comparison of series differing materially in the magnitude of 

individual items is possible with the semi-logarithmic chart. 

7. Percentages of change may be read and percentage relations 

between magnitudes determined directly from the chart. 

CHARTS FOR THE COMPARISON OP FREQUENCIES 
A different type of chart is called for when the object is 
the comparison of frequencies, that is, numbers of events 
or things of different classes. The following census figures 
may serve to illustrate the problem. 

Table 3 

Farms in New England States in 1935 
State Number of farms 

Maine 41,907 

New Hampshire 17,695 

Vermont 27,061 

Massachusetts 35,094 

Rhode Island 4,327 

Connecticut 32,157 

A graphic comparison of these six states with respect to num- 
ber of farms in 1935 is afforded by the bar diagram in Fig. 18. 
This is a simple but effective type of chart for this purpose. 

Further examples of this type of chart, as employed in 
the representation of frequency distributions, are contained 
in the next chapter. It is there shown how a frequency 
polygon or frequency curve may grow out of the simple 
bar diagram, when data of certain kinds are being handled. 
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Fig. 18. — Farms in New England States in 1935 


Such frequency curves constitute very important graphic 
types, but it will be more appropriate to treat them in 
full at a later point. 

CHARTS FOR THE REPRESENTATION OF COMPONENT PARTS 

It is frequently desirable in tabular and graphic presenta- 
tion to break up a total into its component parts, in order 
that changes in the parts as well as in the total may be fol- 
lowed. The table on page 43 exemplifies this procedure. 

These figures are presented graphically in Fig. 19, which 
reveals the varying post-war fortunes of different interests 
in American manufacturing industries. It is clear from the 
diagram that the general swings of material costs, labor costs 
and overhead costs in American manufacturing industries 
have paralleled the fluctuations in total value of products. 
Some of the movements of the component items are of ex- 
ceptional interest, however. Overhead costs (with which 


25500 
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Table 4 , 

Total Value of Products and Elements of Production CostSj 
Manuf acturing Industries of the United States, 

1919-1935 

(Millions of dollars) 


Year 

Cost of 

Labor cost 

Overhead cost 

Total value 

materials ^ 

(wages) 

plus profits ^ 

of products 

1919 

$37,233 

$10,462 

$14,347 

$62,042 

1921 

25,321 

8,202 

10,130 

43,653 

1923 

34,706 

11,009 

14,841 

60,556 

1925 

35,936 

10,730 

16,048 

62,714 

1927 

34,803 

10,836 

16,639 

62,278 

1929 

38,178 

11,607 

20,176 

69,961 

1931 

21,681 

7,173 

12,181 

41,038 

1933 

16,821 

5,262 

9,276 

31,359 

1935 

26,264 

7,545 

11,951 

45,760 



Fig. 19, — Total Value of Products and Elements of Production Costs, 
Manufacturing Industries of the United States, 1919-1935 


^ Tneluding containers, fuel and purchased electric energy. 

- This item represents the difference between total direct costs (materials 
and wages) and total value of products. It includes overhead costs proper, 
plus salaries, taxes, profits, etc. 
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profits are here combined) showed a notable expansion be- 
tween 1921 and 1929. The great recession that followed 
squeezed all the elements of the total, forcing them to levels 
well below those of the 1921 depression. 


CUMULATIVE CHAETS 

In many eases chief interest in the development of a 
series attaches not to the value of each successive item but 
to the cumulated total of a number of such items. This 
may be so when a yearly production program has been 
laid out. In such a case it is the relation between cumu- 
lated production to date and scheduled production to date 
which is of major interest, and a chart form is needed 
which will enable this comparison to be made. The fol- 
lowing figures illustrate the type of data for which such 
charts are appropriate. 

Table 5 


Cumulative Production Schedule and Cumulative 
Output, 1936 

(Speedwell Automobile Company) 


Month 

Production 

schedule 

{cars) 

Cumulative 

production 

schedule 

{cars) 

Output 

{cars) 

Cumulative 

output 

{cars) 

January 

8,000 

8,000 

6,125 

6,125 

February 

10,000 

18,000 

9,250 

15,375 

March 

12,000 

30,000 

10,514 

25,889 

April 

15,000 

45,000 

15,131 

41,020 

May 

14,000 

59,000 

12,159 

53,179 

June 

12,000 

71,000 

13,250 

66,429 

July 

11,000 

82,000 

11,462 

77,891 

August 

10,000 

92,000 

10,531 

88,422 

September 

6,000 

98,000 

4,621 

93,043 

October 

9,000 

107,000 

9,843 

102,886 

November 

10,000 

117,000 

13,785 

116,671 

December 

10,000 

127,000 


It is assumed that this table represents the situation as 
of the end of November. 
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In Fig. 20 the two cumulative curves are plotted. The 
relation between actual and scheduled production at the 
end of each month is shown on the chart, and it is possible 

No. of 

Cars 



Fig. 20. — Comparison of Scheduled and Actual Output (Cumulative) 
Speedwell Automobile Co. 1936 


from the scale to read the approximate amount by which 

production is behind schedule. By reference to the figures, 
which should always accompany the chart, the exact rela- 
tion may be determined. Such a chart has many ap- 
plications, some of which are illustrated in the following 
chapter. 
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THE GANTT PEOGKBSS CHART 

The same data may be presented in a very effective form 
by making use of a type of chart developed by Mr. H. L. 
Gantt. An adequate description of this chart and of its 
many uses would far exceed the space which can be given 
to it here, but its characteristics may be indicated in a very 
brief account. 

Once a schedule has been drawn up, the Gantt chart 
may be utilized in checking actual accomplishment against 
the schedule. Having such a schedule as that given in 
Table 5, the monthly and annual quotas may be entered 
on a form similar to that shown in Fig. 21. The entry to 
the left of each monthly space indicates the amount sched- 
uled for production during that month. The entry to the 
right of each monthly space indicates the cumulated sched- 
uled production to the end of the given month. In this 
figure the I'esults of the first two months’ operations are 
shown. The heavy black line indicates the cumulated 
actual production during this period, amounting to 15,375 
cars. The narrow upper lines in the January and February 
columns measure the actual production in each of those 
months. If actual production in either month had equaled 
the scheduled production the light line would extend across 
the full monthly space. When actual production in a given 
month exceeds the scheduled production a double light line 
appears. 

It should be noted that the spaces into which each monthly 
period is divided represent equal time intervals but varying 
amounts in terms of actual production. Thus the space 
representing one fifth of the January interval represents a 
production of 1,600 cars (the January quota being 8,000). 
The space representing one fifth of the April interval repre- 
sents 3,000 cars (the April quota being 15,000). In reading 
the chart in terms of absolute magnitudes reference must be 
had to the monthly quotas. 
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The situation at the end of November is shown in Fig. 22. 
The arrow at the top of the diagram indicates the point of 
time actually reached. That actual production is slightly 
behind scheduled production is apparent from the relation 
between this arrow and the heavy black line, while the light 
lines indicating monthly production show that actual output 
has exceeded the monthly quota in five of the last six 
months. 

The Gantt chart has a great variety of applications in 
governmental and business organizations. The economy 
of space is such that developments in a number of depart- 
ments or districts may be shown on a single chart. It 
constitutes the simplest and most effective graphic method 
known for following the progress of work under way, for 
comparing actual accomplishment with an established pro- 
gram. And in so doing, it increases by so much the efficiency 
of administrative control. 

Pbepereed Practice for Graphic Presentation 

Graphic methods have been widely employed in the physi- 
cal and social sciences and in business, and the resulting 
diversity of uses has made it difficult to secure standardiza- 
tion of practice. To remedy this defect a committee repre- 
senting engineering, statistical and research organizations 
was organized in 1929, under the sponsorship of the American 
Society of Mechanical Engineers, for the purpose of formulat- 
ing principles of preferred practice in this field. This group, 
acting as a Sectional Committee of the American Standards 
Association, is compiling a code of preferred practice for 
graphic presentation. The first section of this code, dealing 
with time series charts, has been issued by the sectional com- 
mittee. This report furnishes an excellent summary of con- 
ventional procedures, with detailed recommendations con- 
cerning principles appropriate to the graphic presentation of 
time series. Somewhat more specialized, although it deals 
with certain principles applicable to the entire field of 
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graphics, is another report of the same committee on charts 
suitable for use as lantern slides.^ 
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THE ORGANIZATION OF STATISTICAL DATA: 

THE FREQUENCY DISTRIBUTION 

The task of the statistician engaged in business or eco- 
nomic research includes the organization, analysis and in- 
terpretation of quantitative data relating to business affairs 
and to economic conditions. To these fundamental opera- 
tions that of collecting the original data may be added, 
though more frequently data will be compiled directly from 
primary or secondary sources. 

At the outset it is necessary to distinguish between the 
problems arising in the analysis of time series and those 
involved in the organization and analysis of materials in 
connection with which the time factor does not enter. In 
studying a time series the primary object is to measure and 
analyze the chronological variations in the value of the 
variable. Thus one may study variations in sales over a 
period of years, fluctuations in the production of bituminous 
coal, or changes in the general level of prices. Quite differ- 
ent is the procedure in the study of such a problem as 
income distribution at a given time. In this case we are 
desirous of knowing how many people in the United States 
fall in each of a number of income classes. The general 
problem of organization in this latter class of cases is to 
determine how many times each value of a variable is re- 
peated and how these values are distributed. Data of this 
sort, when organized, constitute & frequency smes, as opposed 
to the time or historical series. The methods appropriate 
to these two types of analysis differ fundamentally and will 
therefore be treated separately. In the present section we 
are concerned with the organization and preliminary analy- 
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sis of data in connection with which the time element, 
while it may be present, does not enter as a factor. 


Unorganized Data 

When quantitative data of the type with which the 
statistician works are presented in a raw state they appear 
as unorganized masses of material, without form or struc- 
ture. They may have been drawn from the production or 
sales records of a business establishment, or they may 
represent a miscellaneous collection of price quotations. If 
the data have been gathered by other agencies they may 
already have been arranged in the form of a general table, 
but this form may be entirely unsuited to the particular 
object in the mind of the investigator. The first task of 
the statistician is the organization of the figures in such 
a form that their significance, for the purpose in hand, may 
be appreciated, that comparison with masses of similar 
data may be facilitated, and that further analysis may be 
possible. Scientific method, it has been noted, involves 
observation, inference, and verification. Data, the results of 
observation, must be put into definite form and given 
coherent structure before the process of inference is possi- 
ble. 

The figures on page 52, representing the earnings during 
a given week of 210 individuals engaged in piece work in a 
certain manufaduring establishment, will serve as an ex- 
ample of such data in their raw state. 

The Array 

If these figures are arranged in order of magnitude some- 
thing will have been done toward securing a coherent 
structure. The range covered and the general distribution 
throughout this range will then be clear, and the way will 
be prepared for further organization. When so arranged 
the arra?/ on page 53 is secured. 
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Weekly Earnings oe 210 Employees 


S 26.25 

$ 28.70 

$ 24.15 

$ 29.75 

$ 29,20 

$ 30.60 

123.40 

$ 24.75 

26.70 

24.35 

25.75 

27.20 

28.30 

25.25 

27.75 

27.60 

28.20 

27.30 

27.80 

26.35 

27.40 

28.30 

26.60 

25.75 

27.70 

28.60 

25.30 

27.80 

26.40 

27.30 

28.35 

27.00 

24.30 

27.80 

27.60 

26.30 

27.40 

23.50 

29.60 

27.80 

27.60 

25.35 

27.55 

29.00 

24.10 

27.00 

24.50 

27.25 

26. 15 

29.30 

23.10 

27.10 

28.50 

27.45 

26.15 

28.35 

27.95 

25.55 

27.55 

26.60 

24.25 

30.00 

28.55 

28.00 

27.30 

27.90 

25.25 

24.10 

27.45 

24.55 

26.55 

27.55 

26.75 

31.00 

24.00 

25.35 

26.50 

28.30 

27.95 

25.55 

30.25 

28 . 55 ' 

26.75 

24.60 

25,75 

26.55 

27.80 

28.90 

29.55 

30.00 

24.60 

25.75 

26.30 

27.00 

28.25 

25.25 

25.75 

26.25 

26.30 

26.75 

27.90 

28.30 

25.70 

26.30 

26.60 

27.00 

30.75 

28.60 

28.10 

23.50 

24.75 

25.15 

26.30 

27.25 

28.15 

29.10 

30.10 

29.90 

28.55 

27.30 

26.55 

27.55 

23.00 

24.50 

22.85 

26.55 

27.55 

28.10 

30,70 

28.60 

27.90 

26.80 

24.10 

25.25 

26.30 

27.90 

26.90 

25.30 

25.80 

28.85 

27.55 

27.30 

25.00 

26.00 

26.55 

27.80 

28.60 

30.55 

29.50 

24.10 

25.15 

27.15 

28.10 

26.30 

27.10 

24.60 

27.80 

26.30 

27.90 

29.80 

24.10 

25.15 

27.50 

24.25 

25.70 

26.80 

30.15 

29.30 

28.15 

28,65 

24.55 

25.85 

26.10 

27.00 

26.80 

27.55 

29.00 

23.00 

28.60 

29.30 

28.55 

28.80 

27.55 

23.60 

26.10 

27.15 

25.75 

26.80 

27.15 

26.30 

28.55 

25.80 

24.55 

25.80 

26.75 

27.30 

27.55 

28.25 

25.60 

26.30 

26.85 

28.60 

27.30 

26.00 

28.10 

32.00 

28.15 

26.30 

27.75 

26.25 


Frequency Tables 

\\lule the array presents the figures in a shape much 
more suitable for study than the haphazard distribution 
first shown, there is still something to be desired before the 
mind can readily grasp the full significance of the data. 
The factory manager may see that the smallest amount 
earned during the week was $22.85, that the largest amount 
earned was $32 . 00, and that most of the employees earned 
between $25.00 and $29.00, but this is still a vague descrip- 
tion of the data. By a process of grouping, that is, by 
putting into common classes all individuals whose earning 
fall within certain limits, a simplified and more compact 
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Areay: Weekly Earnings of 210 Employees 


122.85 

125. 15 

$ 26.15 

$ 26.75 

$ 27.45 

$ 27.95 

$ 28.60 

23.00 

25,15 

26.15 

26.75 

27.45 

27.95 

28.65 

23.00 

25.15 

26.25 

26.80 

27.50 

28.00 

28.70 

23.10 

25.25 

26.25 

26.80 

27.55 

28.10 

28.80 

23.40 

25.25 

26.25 

26.80 

27.55 

28.10 

28.85 

23.50 

25.25 

26.30 

26.80 

27.55 

28.10 

28.90 

23.50 

25.25 

26.30 

26.85 

27.55 

28.10 

29.00 

23.60 

25.30 

26.30 

26.90 

27.55 

28.15 

29.00 

24.00 

25.30 

26.30 

27.00 

27.55 

28.15 

29.10 

24.10 

25.35 

26.30 

27.00 

27.55 

28.15 

29.20 

24.10 

25.35 

26.30 

27.00 

27.55 

28.20 

29.30 

24.10 

25.55 

26.30 

27.00 

27.55 

28.25 

29.30 

24,10 

25,55 

26.30 

27.00 

27.60 

28.25 

29.30 

24.10 

25.60 

26.30 

27.10 

27.60 

28.30 

29.50 

24.15 

25.70 

26.30 

27.10 

27.60 

28.30 

29.55 

24.25 

25.70 

26.30 

27.15 

27.70 

28.30 

29.60 

24.25 

25.75 

26.35 

27.15 

27.75 

28.30 

29.75 

24.30 

25,75 

26.40 

27.15 

27.75 

28.35 

29.80 

24.35 

25.75 

26.50 

27.20 

27.80 

28.35 

29.90 

24.50 

25.75 

26.55 

27.25 

27.80 

28.50 

30.00 

24.50 

25.75 

26.55 

27.25 

27.80 

28.55 

30.00 

24.55 

25.75 

26.55 

27.30 

27.80 

28.55 

30.10 

24.55 

25.80 

26.55 

27.30 

27.80 

28.55 

30.15 

24.55 

25.80 

26.55 

27.30 

27.80 

28.55 

30.25 

24.60 

25.80 

26.60 

27.30 

27.80 

28.55 

30.55 

24.60 

25.85 

26.60 

27.30 

27.90 

28.60 

30.60 

24.60 

26.00 

26.60 

27.30 

27.90 

28.60 

30.70 

24.75 

26.00 

26.70 

27.30 

27.90 

28.60 

30.75 

24.75 

26.10 

26.75 

27.40 

27.90 

28.60 

31.00 

25.00 

26.10 

26.75 

27.40 

27.90 

28.60 

32,00 


presentation of the wage distribution may be obtained. The 
following table shows the results of this grouping process 
when the range of each class (the class-interval) is two 
dollars. 

This table presents a condensed summary of the original 
figures, a summary which not only gives us the approximate 
range of the earnings, but shows, also, how the earnings of 
the 210 workers are distributed throughout this range. 
There has been a considerable loss of detail, it will be 
noted. 
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Table 6 

Fveguency Distribution of Employees 
(Classified on the basis of weekly earnings [class-interval = $2]) 

Number earning stated amount 

Weekly earnings 

$22.00 to $23.99 

24.00 to 25.99 

26.00 to 27.99 

28.00 to 29.99 

30.00 to 31.99 

32.00to 33.99 

From this table we may learn that there are 48 persons who 
earned during the given week between S24.00 and $25.99, 
but we cannot learn how the earnings of the 48 individuals 
were distributed throughout this range of two dollars. All 
may have earned exactly $24.00, so far as we may know 
from the figures shotvn in the table. This loss of detail is 
an inevitable accompaniment of the condensation and sim- 
plification which the process of classification involves. 

If the size of the class-interval be decreased the loss of 
detail is less pronounced, though the increase in the number 
of classes means a more cumbersome table and one which 
presents a more complex picture to the eye. The tables 
on page 55 present the same data, classified wfith intervals 
of one dollar, fifty cents, and twenty-five cents. 

The four tables wo have thus constructed represent four 
dii'ferent degrees of condensation of the same data. Tables 6, 
7, and 8 present the same general characteristics: a small 
number of cases in the extreme classes and a more or less 
regular increase in the frequencies as the center of each of 
the distributions is approached. The departure from reg- 
ularity becomes greater the greater the number of classes. 
Table 9, in which the class-interval is 25 cents, has 38 classes. 
In this table the distribution of cases throughout the range 
is highly irregular, with pronounced departures from sym- 


(frequency) 

8 

48 

96 

47 

10 

1 

210 
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Feequency Distributions of Employees 


(Classified on the basis of weekly earnings) 


Table 7 
(Class-interval = 

=== $1) 

Table 8 

(Glass-interval == 50 cents) 

Table 9 
(Class-interval = : 

25 cents) 

Weekly 

Fre- 

Weekly 

Fre- 

Weekly 

Fre- 

earnings 

quency 

earnings 

quency 

earnings 

quency 

122.00 to $22.99' 

1 

$22.50 to $22.99 

1 

$22.75 to $22.99 

1 

23.00 to 23.99 

7 

23.00 to 

23.49 

4 

23.00 to 

23.24 

3 

24.00 to 24.99 

21 

23 . 50 to 

23.99 

3 

23.25 to 

23.49 

1 

25. 00 to 25.99 

27 

24.00 to 

24.49 

11 

23. 50 to 

23.74 

3 

26.00 to 26.99 

42 

24.50 to 

24,99 

10 

23 . 75 to 

23.99 

0 

27.00 to 27.99 

54 

25.00 to 

25.49 

12 

24.00 to 

24 . 24 

7 

2S.00to 28.99 

34 

25 . 50 to 

25.99 

15 

24 . 25 to 

24.49 

4 

29.00 to 29.99 

13 

26.00 to 

26.49 

22 

24 . 50 to 

24.74 

8 

30.00 to 30.99 

9 

26.50 to 

26.99 

20 

24.75 to 

24.99 

2 

31.00 to 31.99 

1 

27.00 to 

27.49 

24 

25.00 to 

25.24 

4 

32, 00 to 32.99 

1 

27.50 to 

27.99 

30 

25.25 to 

25.49 

8 


210 

28.00 to 

28.49 

17 

25.50 to 

25.74 

6 



28.50 to 

28.99 

17 

25 . 75 to 

25.99 

10 



29.00 to 

29.49 

7 

26. 00 to 

26.24 

6 



29 . 50 to 

29.99 

6 

26.25 to 

26.49 

16 



30.00 to 

30.49 

5 

26 . 50 to 

26.74 

10 



30.50 to 

30.99 

4 

26. 75 to 

26.99 

10 



31.00 to 

31.49 

1 

27.00 to 

27.24 

11 



31.50 to 

31.99 

0 

27.25 to 

27.49 

13 



32.00 to 

32.49 

1 

27.50 to 

27.74 

14 

[• 




210 

27. 75 to 

27.99 

16 






28.00 to 

28.24 

9 






28.25 to 

28.49 

8 






28.50 to 

28.74 

' 14 






28. 75 to 

28.99 

3 






29.00 to 

29.24 

4 






29.25 to 

29.49 

3 






29.50 to 

29.74 

3 






29. 75 to 

29.99 

3 






30.00 to 

30.24 

4- 






30. 25 to 

30.49 

1 






30.50 to 

30.74 

3 






30.75 to 

30.99 

1 






31.00 to 

31.24 

1 






31. 25 to 

31.49 

0 






31.50 to 

31.74 

0 






31.75 to 

31.99 

0 

i''. ■ 





32.00 to 

32.24 

,,_1' . 

I' • — 
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metry. The structure of each of the other tables is orderly 
and approaches more closely a condition of symmetry. Each 
presents the wage data in condensed and compact form, so 
that one consulting the tables may learn of the size and 
distribution of weekly earnings in the factory in question 
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much more readily than by reference to the chaotic collec- 
tion of figures first shown. Such organized collections of 
data are termed frequency distributions, and their purpose, 
as the term implies, is to show in a condensed form the na- 
ture of the distribution of a variable quantity throughout 
the range covered by the values of the variable. The con- 
struction of such a table is the first step to be taken in the 
organization and analysis of quantitative data of the type 
represented above. 

STEPS IN THE CONSTRUCTION OP A FREQUENCY TABLE 

This general introduction to the subject of frequency 
tables has left untouched many important matters in con- 
nection with their construction. It remains to present a 
summary statement of these details. It will be clear that 
the first step here taken, the arrangement of the items in 
order of magnitude, is unnecessary in the actual construc- 
tion of such a table. Having determined the upper and 
lower limits through an inspection of the data, one has 
but to decide on the number of classes desired, write the 
class-intervals on an appropriate blank sheet, and proceed 
to tally the cases falling in each of the classes thus set off. 
When this process is completed the frequencies are com- 
puted and the totals arranged in tabular form of the type 
illustrated above. These simple operations involve decisions 
on a number of points, however. 

SIZE op CLASS-INTERVAL 

In deciding upon the size of the class-interval (which is 
equivalent to deciding upon the number of classes) one 
fundamental consideration should be borne in mind, namely, 
that classes should be so arranged that there will be no 
material departure from an even distribution of cases within 
each class. This arrangement is necessary because, in inter- 
preting the frequency table and in subsequent calculations 
based upon it, the mid-value of each class is taken to repre- 
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sent the values of all cases falling in that class. Thus, in 
basing calculations upon Table 8, it is assumed that the 
22 cases falling between $26 . 00 and $26.50 may all be repre- 
sented by the mid- value of that class, $26.25. This assump- 
tion will seldom be strictly valid. In the ease just cited 
reference to the original figures will show that it is not a 
correct assumption. Absolute accuracy would only be ob- 
tained by having a class for every value I’epresented in the 
original figures. Since condensation is necessary an arrange- 
ment of classes should be secured which will minimize the 
error involved, without transgressing other requirements. 
Table 6 furnishes an example of class-intervals too wide 
for the material. 

The requirement which has just been described clearly 
calls for a large number of classes. A second requirement, 
which ordinarily conflicts with this, is that the number of 
classes should be so determined that an orderly and regular 
sequence of frequencies is secured. If the classification is 
too narrow for the data regularity will not be attained in 
this respect, and a table without structure or order will be 
secured. Table 9 fails to meet this requirement, as has been 
pointed out. It is desirable, also, that the number of classes 
be limited in order that the data may be easily manipulated 
and their significance readily grasped. 

A useful procedure for approximating a suitable class- 
interval has been suggested by H. A. Sturges. Given a 
series of N items of known range, a suitable class-interval i 
may be approximated from the formula 

_ Range 
* “ 1 -t- 3.322 log N 

The specific figure secured in a given instance is likely to 
be a fractional value, quite unsuited to actual use. An 
appropriate round number close to the theoretical value, 
may be chosen. ^ Thus, in the example cited above, with a 

^ This formula, and the justification for its use, are discussed in ^‘The Choice 
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range of $9 . 15 and N equal to 210, the use of a class-interA^al 
of 11.05 is indicated by the formula. The nearest round 
number, suitable with reference to other considerations as 
well, is $1.00. Table 7, in which this class-interval is em- 
ployed, seems to conform most thoroughly to all the re- 
quirements we have set forth. 

LOCATION OF CLASS LIMITS 

The location of class limits is a matter of considerable 
importance, for attention to this matter will simplify tab- 
ulation and facilitate later calculation. Tabulation of data 
is easiest when class limits are integers and the class-interval 
itself is a Avholc number. Calculation of averages and other 
statistical measures is facilitated when the mid- values of 
classes are integers. Suitable class limits and mid-points 
are usually secured when the data permit class-intervals of 5 
or multiples of 5 to be employed, though such an arrange- 
ment is by no means essential. 

Some types of data show a tendency to cluster or con- 
centrate about certain values on the scale along which they 
are distributed. This is illustrated by the following figures 
which form part of a table showing the number of pieces 
of commercial paper discounted by the Federal Reserve 
Banks in 1921, distributed according to rates of discount 
or interest charged by member banks ; 


(per cent) 

Number of pieces 

6 

18,970 

6i 

697 

H 

4,616 

6f 

135 

7 

17,362 

7i 

10 


of a Class Interval ” by Herbert A. Sturgea, J ournal of the Amencan Statistical 
Association, Martih, 1926, 65-6, The use of the formula rests on the assumption 
ttat the proper distribution into classes is given, for all numbers which are 
ixjwere of 2, by a .series of binomial coefficients. The relation of the terms in 
the binomial expansion to the theory of frequency distributions is discussed 
below, in Qiapter XlII. 
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Here is a quite obvious bunching about the integers, with 
a secondary concentration at each half of one per cent. 
No cases at all fall between the quarter values here shown. 
It is clear that in classifying such data the mid-points of 
the various classes should fall at those values about which 
the eases are concentrated, and class limits must be located 
with this end in view. For, as noted above, calculations 
based upon the frequency table are performed upon the 
assumption that all the items in each class are concentrated 
at the mid-point of that class. Thus, if a class interval of 
one half of one per cent were selected in the above example, 
the classes should extend from 5| to (but not including) 
6i, 6| to 6f, etc., rather than from 6 to 6|, 6| to 7, etc. 


ACCURACY OP OBSERVATIONS AND THE DEFINITION 
OP CLASSES 

In the construction of frequency tables it is essential 
that there be a clear definition of classes, so that there may 
be no uncertainty as to their range and no question as to 
the precise class in which a given case falls. A table with 
an arrangement similar to the following is sometimes en- 
countered: 


Class-interval 


Frequency 


0 to 10 3 

10 to 20 8 

20 to 30 16 

30 to 40 6 

40 to 50 2 


In the absence of explanation, a question arises at once as 
to whether a case mth a value of 10 would fall in the first 
or in the second class. It is highly desirable that the range 
of each class be indicated in some such way as the following, 
in order that this ambiguity may be avoided; 
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Class-interval 

Oto 9.9 
10 to 19.9 
20 to 29.9 
30 to 39.9 
40 to 49.9 


Frequency 

3 

8 

15 

6 

2 


This procedure solves the difficulty, however, only in case 
the observations are accurate to the nearest tenth. If the 
observations are accwate only to the nearest unit (that is, 
if the cases recorded as having a value of 10 actually lie 
between 9.5 and 10.5) a mere change in the description of 
the class range does not solve the problem of allocating a 
case at the class limit. In such a case an observation falling 
at a class boundary may be cut in two, one half being allo- 
cated to each of the adjacent classes. 

Yule lays down the useful principle that in fixing a class 
boundary the limit should be carried to a farther place in 
decimals, or a smaller fraction, than the values of the indi- 
vidual cases as originally recorded. Thus, in the preceding 
example, if observations were correct to the nearest tenth, 
it would mean that a value recorded as 9.9 actually lay 
between 9 . 85 and 9 . 95. In accurately describing the classes, 
therefore, the intervals should be given as 0 to 9 . 95, 9 . 95 
to 19.95, etc. (Since the observations to be tabulated are 
recorded only to the first decimal place no ambiguity arises 
from the apparent over-lapping of these class limits.) It 
should be noted that the values of the mid-points, with 
these class limits, would be 4.95, 14.95, etc. In presenting 
and using the table as given above the real meaning of the 
cla.ss limits should be borne in mind. In all cases class 
boundaries must be fixed with reference to the accuracy of 
the observations. 

The work of tabulation is simplified if, in designating a 
class, both limits are stated, as above. Errois are likely if 
only the lower limit of each class is given, or if the mid- 
point alone is designated. It is desirable, however, par- 
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ticularly if calculations are to be based upon the table, to 
include a separate colmnn showing the values of the mid- 
points of the various classes. 

OTHER REQUIREMENTS 

Class-intervals should be uniform throughout the table 
in order that all classes may be comparable. Occasionally 
tables are published with varying class-intervals, so that 
on one section of the scale the number of items falling 
within a class having an interval of 5 is given, and on 
another section of the scale the number of items falling 
within a class having a range of 10 is given. Obviously, 
comparison of classes is impossible. It may be desirable 
to show in more detail the cases falling within certain 
ranges on the scale, but this end is best achieved by the 
construction of a supplementary table relating only to the 
cases falling within this restricted section. The utility of 
the main table is not lessened thereby. 

Similar in nature is the requirement that there should be 
no indeterminate classes, that is, classes the ranges of which 
are not defined. Had all the individuals making $30 . 00 and 
over in the illustration of piece-work earnings been entered 
in a class with the designation “$30.00 and over,” the 
upper limit of this class would have been quite uncertain. 
This fault in a table is a vital one when it is desired to base 
calculations upon the data contained in the table. When 
there are several extreme cases the inclusion of such classes 
is sometimes unavoidable, but when this is done the actual 
values of the cases included in such “open end” classes 
should be given in a footnote to the table. 

The errors described in the two preceding paragraphs 
are exemplified in the table on page 62. 

In this case the ranges of the two “open end” classes are 
not known. The ranges of the intermediate classes vary, 
being $5.00 for two classes, $10.00 for one class, $20.00 
for one class and $25.00 for two classes. The pm-poses of a 
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Table 10 

Frequency Distribution of Rented Dwellings in Reno, Nevada, 1934 

(Classified on the basis of rental value 9 

Number of dwellings 

Monthly rental in each class 

[frequency) 

Under $10.00 327 

$10.00 to $14.99 349 

15.00 to 19.99 521 

20.00 to 29.99 1,039 

30.00 to 49.99 1,075 

50.00 to 74.99 189 

75.00 to 99.99 24 

$100. 00 and over _9 


special investigation may sometimes be served by the use 
of such a form, but a table of this type is poorly adapted to 
the requirements of statistical calculation. 

The Structure op Statistical Tables 

The preceding discussion has been confined to certain 
more or less technical problems which arise in the construc- 
tion of a frequency table. Nothing has been said directly 
as to the form of the completed table, the arrangement of 
columns and rows, the title, the notation. No general prin- 
ciples of tabular airangement have been laid dowm. While 
no detailed treatment of these principles is possible within 
the scope of the present discussion, certain general consid- 
erations relating to the structure of statistical tables may be 
.suggested. 

The statistical table is merely a device for presenting in 
summary fashion a mass of quantitative data. Unless the 
summary be clear, significant, concise, and readily inter- 
preted nothing has been gained by the process of tabulation 

* The table is taken from Real Properly Inventory, 19S4. Summary and Sixty- 
Four Cilm CuuiUiied, Department of Commeroe, Washington. Figures for 
255 rented dwellings in Keiio were not reported. 
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and classification. A sprawling, formless table is like a 
rambling, unintelligible discourse. There must be a purpose 
in back of each table, and this purpose should be clearly 
brought out in its arrangement. The means by which this 
purpose may be attained in a given case must be deter- 
mined with I’eference to the specific conditions affecting that 
case, but standard practices should be followed, in so far 
as possible. The following general principles will be found 
helpful in deciding upon the form and arrangement of sta- 
tistical tables : 

1. The title should constitute a clear, concise and complete 

description of the material afssembled in the table. 

2. Headings of columns and rows should be concise and unambigu- 

ous. 

3. Variable quantitie,s should increase from left to right and 

from top to bottom, when such arrangement is feasible. 

4. Columns and rows may be numbered to facilitate reference to 

the table. 

5. The units of measurement employed should be clearly indicated. 

6. Sources should be gi\^en in all cases. 

7. The table should constitute a unit, self-sufficient and self- 

explanatory. All explanations necessary for its interpre- 
tation should be included as integral parts of the table, or in 
the form of footnotes. 

Graphic Representation of Frequency Distributions 

Frequency distributions of the type illustrated above serve 
a very important statistical function in presenting a compact 
summary of data, and in preparing these data for further 
manipulation. Such distributions may be presented not 
only in tabular form, but graphically, utilizing the general 
principles of the coordinate system which were explained 
above. Many of the characteristic features of a frequency 
distribution are most clearly revealed when the graphic 
method is adopted. 

Table 0, presenting the weekly earnings of 210 employees, 
with a class-interval of two dollars, is depicted graphically 
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in Fig. 23. In this figure class-intervals are plotted along 
the x-axis and the corresponding class-frequencies along the 
2/-axis, appropriate scales being selected. The fact should 
be noted that the scale of abscissas starts not with zero, 
but with $20. For convenience in presentation that part 



Fig. 23. — Column Diagram: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval = $2.00) 

of the scale extending from 0 to $20 is omitted. The student 
should bear this in mind in seeking to secure a correct 
impression of the relations between the two variables plotted. 
In constructing such a figure, which is termed a column 
diagram or histogram, short horizontal lines are drawn con- 
necting the points plotted to represent the upper and lower 
limits of each class-interval. In interpreting this diagram 
it should be noted that the areas of the different rectangles 
are proportional to the number of cases represented, the 
total area representing the entire 210 cases. This device 
thus presents to the eye a very clear picture of the distribu- 
tion, showing quite unmistakably the relative number of 
workers falling in each of the wage classes. 

The classes in this case are so large, however, that some 
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violence is done to the facts. So many details are lost 
that a true conception of the disposition of the items is not 
given. Fig. 24 is a histogram depicting the distribution of 
cases when a class-interval of one dollar is used. In this 
case, with smaller steps, we approach more closely an orderly 
and synnanetrical distribution. The same is true of Fig. 25 



on the Basis of Weekly Earnings (Class-interv^al = $1.00) 


which shows the distribution when the class-interval is 
fifty cents. The distribution represented in Fig. 26 has a 
class-interval of twenty-five cents which, as has been pointed 
out, is too narrow for the data, with the result that a quite 
irregular structure is secured. (It should be noted that 
the vertical scale is not the same in these four figures, so 
that comparison with respect to class-frequencies is only 
possible by reference to the scale figures.) 




Class Interval $.50 


Dollars 

— Cfjlunin Dia^]?ranK Distribution of 210 Employees Classified 
<01 the Basis of Weekly Earnings (Class^interval = §.50) 


Class Interval 


— Column Diagram: Distribution of 210 Employees Classified 
on the Basis of Weekly EarningB (Class'-interval = $, 25 ) 







too 



Fig. 27. — Frequency Polygon : Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval = $2.00) 



Fig. 28. — Frequency Polygon: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval = 11.00) 
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Freqtisncy polygons corresponding to the histogi-ams of 
Pigs. 23, 24, and 25 are shown in Figs. 27, 28, and 29. 
Each of these polygons has been constructed by plotting as 
abscissas the mid-points of the class-intervals, and as ordi- 
nates the class-frequencies, the points thus secured being 



Fig. 29. — Frequency Polygon: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval = $.50) 


connected by a broken line. In completing such a figure 
the class next below the lowest one on the scale and the 
class next above the highest one on the scale are included, 
the class-frequency being zero in each case. The ends of 
the polygon thus connect with the base line at the mid- 
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points of these two extra classes. In the case of the fre- 
quency polygon the entire area under the curve represents 
the entire number of cases, but the area of a given interval 
cannot be taken to be proportional to the number of cases 
in that interval, because of irregularities in the distribution 
on either side of the given class. The heights of the ordi- 
nates at the mid-points of the various classes are, of course, 
scaled to represent the class-frequencies. 

The Smoothing of Curves 

Attention is again called to the results secured with 
varying class-intervals. As the class-interval is decreased, 
up to a certain point, the histograms and polygons become 
smoother and more regular. Beyond that point breaks 
begin to appear in the data; the regular change in class- 
frequencies which was found when the classes were larger 
is broken by the appearance of irregular classes which 
seem to depart from the general rule. In Fig. 25 these 
have become quite pronounced. Such irregularities, it is 
obvious, are exceptions to a general rule which seems to 
prevail, the general rule that the numbers of workers falling 
within the different wage classes increase from the lower 
hmit of earnings up to a maximum in the neighborhood of 
$27.50, and then decrease till, at the upper limit of $32, 
but one worker is found. Since all the 210 individuals are 
engaged in the same work, and since their earnings depend 
only upon their rapidity and skill, one would expect a quite 
regular increase and decrease. If we had figures not for 
one week only, but for 52 weeks, and took the average 
weekly earnings of each of the 210 workers for the year, 
we should expect greater regularity with the smaller class- 
intervals than is actually found, since the accidental fluc- 
tuations peculiar to one week alone would thus be elimi- 
nated. Or, if we had earnings during one week for 10,920 
workers (52 times 210), the same result would be secured. 
Thus, if regularity and smoothness are to be secured, it is 
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essential not only to decrease the size of the classes but 
also to increase the number of cases, in order that the 
accidental irregularities which affect a small number of 
observations may be eliminated. A refined classification 
with a small number of cases leads to the condition exempli- 
fied in Fig. 26. But such an increase in the number of cases 
is, in general, a practical impossibility. We wish, if pos- 
sible, to develop a feasible method of approximating the 
distribution which would be secured with very small class- 
intervals and a very large number of cases. Such an approxi- 
mation is possible through the device of curve-smoothing. 
By this method we may secure a smooth frequency curve 
which lacks the irregularities occasioned by minor fluctu- 
ations. 

Such a smooth frequency curve serves to represent the 
true underlying distribution of the data. It was pointed 
out that areas in the frequency polygon are not propor- 
tional to the number of cases included, the cause lying in 
the irregularities of the data. In a smoothed frequency 
curve these irregularities have been eliminated, and the 
area between ordinates erected at given points on the scale 
of abscissas is assumed to be proportional to the theoretical 
frequency of eases between the given values. Moreover, a 
smooth trend having been established, frequencies for in- 
termediate values not shown in the original table may be 
determined by interpolation.* 

The following data,® representing the distribution in 1918 
of personal incomes below 14,000, will serve to exemplify 
the smoothing process. 

* The limitations of practical statistical work are such that there must of 
necessity be many gaps in the data. The given values of the variables are not 
continuous. Interpolation is the process of estimating values of a variable 
quantity between given values, or of locating a point on a curve between given 
imiirtB. That interpolation is most accurate which leads to estimated values 
liaving the highest degree of consistency with the given values. 

* From Vol. I, Income in the United Stales, National Bureau of Economic 
Hesearcii. New York, Haroourt, Brace & Co., 1921, 132-33. 
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.Table ,11 , 

Distribiition of Income amoyig Personal Income Recipients in 1918 
(Including all personal incomes below $4,000) 


Income class ^ 

Number of persons ^ 

$ 0 to $ 100 

62,809 

100 to 200 

103,704 

200 to 300 

209,087 

300 to 400 

489,963 

400 to 500 

961,991 

500 to 600 

1,549,974 

600 to 700 

2,154,474 

700 to 800 

2,668,466 

800 to 900 

3,013,034 

900 to 1,000 

3,144,722 

1,000 to 1,100 

3,074,351 

1,100 to 1,200 

2,850,526 

1,200 to 1,300 

2,535,285 

1,300 to 1,400 

2,205,728 

1,400 to 1,500 

1,832,230 

1,500 to 1,600 

1,512,649 

1,600 to 1,700 

1,234,397 

1,700 to 1,800 

999,996 

1,800 to 1,900 

811,236 

1,900 to 2,000 

663,789 

2,000 to 2,100 

549,787 

2,100 to 2,200 

463,222 

2,200 to 2,300 

395,115 

2,300 to 2,400 

340,141 

2,400 to 2,500 

295,490 

2,500 to 2,600 

258,650 

2,600 to 2,700 

227,731 

2,700 to 2,800 

201,488 

2,800 to 2,900 

178,901 

2,900 to 3,000 

154,499 

3,000 to 3,100 

142,802 

3,100 to 3,200 

128,217 

3,200 to 3,300 

115,583 

3,300 to 3,400 

104,504 


* The definition of classes used is equivalent to ‘‘SO to and not including 
$100,'’ etc. Thus an individual with an income of SlOO would fall in the second 

.class, 


2 The National Bureau’s report states “The numbers below are given to the 
nearest unit. It is not pretended that such arithmetic accuracy is anything 
more than technical.” 
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Table 11 — Continued 

Distribution of Income among Personal Income Recipients in 1918 
Income class Number of persons 

$3,400 to 3,600 $94,803 

3,500 to 3,600 86,405 

3,600 to 3,700 79,023 

3,700 to 3,800 72,562 

3,800 to 3,900 66,900 

3,900 to 4,000 61,894 

Figures 30, 31, and 32 present column diagrams of these 
income data, grouped with class-intervals of $500, $200, and 


Class Interval $500. 


500 1000 1500 2000 2500 3000 3500 4000 

Dollars 

Fig. 30. — Column Diagram; Distribution of Personal Income Recipients 
in the United States, 1918. Including all Recipients of Incomes below 
$4,000 (Class-interval = $500) 

$100. As the class-interval is decreased the histograms be- 
come more regular and uniform, but our original data 
permit us to carry this process only to the point where the 
class-interval is $100. Our problem is to determine the 
underlying distribution which the data approximate more 
and more closely as the class-interval is lessened. If we 
replace the broken line of the histogram by a smooth curve 
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Fig. 31. Coiumii Diagram: Distribution of Personal Income Recipients 
in the United States, 1918. Including all Recipients of Incomes below 
S4t)0() (Cdass4nteiwal,= $200) 



Dollars 

Fig. 32. — Column Diagram: Distribution of Personal Income Recipients 
in tile United States, 1918. Including ail Recipients of Incomes below 

.$4:,,(K)0 (Class-interval « $100) ' 
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enclosing the same total area as the histogram and so drawn 
through the points of the histogram that the area cut from 
each rectangle is approximaiely equal to the area added to 
the same rectangle by the curve, we will have a frequency 



Fig. 33. — Freiiueney Curve: Distribution of Personal Income Recipients 
in the United States, 1918. Including all Recipients of Incomes below 
84,000 (Derived from the column diagram with class-interval of $100) 


cui've representing the desired distribution. The require- 
ment that the same total area be enclosed is fundamental. 
Exceptions to the rule concerning the area of individual 
rectangles will frequently occur because of the existence of 
(juite irregular classes, but as a general working principle 
it. is helpful. (More refined methods of fitting a smooth 
curve to data will be discussed at a later point, but a process 
of smoothing by inspection such as that described above 
gives a fairly close approximation to the requhed cm've.) 

Figure 33 illustrates the result of smoothing the histo- 
gram of income distribution shown in Fig. 32. Here the 
<iuite artificial jumps between income classes are smoothed 
out, and wx' secure the graduation by infinitesimal incre- 
ments which we should expect to find when the incomes of 
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so many millions of persons are included. Here we have 
that which we desired — ^an approximation to the true un- 
derlying distribution, with the sharp breaks resulting from 
the method of classification eliminated. 

CONTINUOUS AND DISCRETE SERIES 

The logical validity of the smoothing process is dependent 
upon the nature of the data being manipulated. From 
this point of view frequency series of the type discussed 
above may be divided into two classes, continuous series 
and non-continuoiis or discrete series. A continuous series is 
one in which the values of the independent variable in- 
crease or decrease by increments which are infinitely small. 
A discrete series is one in which the phenomena represented 
by the independent variable always change in value by 
definite amounts. The curve of underlying values rises not 
smoothly, as for the continuous series, but by jumps. 

The fact should be emphasized that in making this dis- 
tinction we are speaking of the values as they would be found 
in the underlying universe of phenomena from which the ac- 
tual bodies of material we study are drawn. Any given 
sample, whether representing continuous or discrete series, 
will be marked by breaks in the values of the independent 
variable. This will be true, in the case of a continuous 
series, because of the limitations upon the instruments and 
senses we use in measuring. Thus if the heights of 100 men 
be measured, the independent variable of the frequency 
series (height) will increase by finite amounts. We may 
measure to the nearest inch, or perhaps to the nearest 
eighth of an inch. Yet if ten thousand or ten million men 
were arranged in order of height the differences between 
successive individuals would be infinitely small. Height is a 
continuous variable, even though the values found in a given 
sample are marked by discontinuity. 

Quite different is the distribution of such a variable as 
interest or discount rates. If one were to secure 100 such 
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quotations and rank them in the order of size the varia- 
tions would be discontinuous, as in the sample of men 
whose heights were measmed. But in the case of heights 
the underlying values, if they could be determined for a 
large population, would be marked by continuous varia- 
tion, whereas, were an infinite number of discoimt rate 
quotations secured, there would still be breaks in the se- 
quence. Discount rates increase or decrease by one quarter 
or one half of one per cent, not by infinitesimal amounts. 
Such a series is termed discrete, or non-continuous. 

The smoothing process provides a means of securing an 
approximation to the distribution of values as they would 
be found if a sample could be increased indefinitely in size. 
It is based upon the assumption that the irregularities 
found in the sample actually studied are accidental, and 
that the underlying values would show continuous and un- 
broken variation. Obviously, therefore, it is only fully 
justified when applied to a continuous series. A histogram 
of human heights may be smoothed in order to secure a 
representation of the true underlying distribution in the pop- 
ulation at large, and interpolation based upon this smoothing 
process is valid. But smoothing is quite illogical for a 
markedly discontinuous series. It would be meaningless to 
construct a smooth curve showing the distribution of dis- 
coimt rates for the purpose of securing the theoretical fre- 
quency of a rate of 4.3675 per cent. In practical statistical 
work, however, it is frequently helpful to handle discrete 
series as though they were continuous, and in these cases 
the smoothing device may be employed. But in the inter- 
pretation and use of the smoothed curve the important 
logical distinction between continuous and discontinuous 
variation should be kept clearly in mind. 

Cumulative Areanqement of Statistical Data 

For certain purposes it is desirable to arrange data cumu- 
latively, rather than in separate and exclusive classes of 
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the type illustrated in the frequency tables presented above. 
The following material will illustrate some of the advantages 
of this arrangement. 

In a study of the durability of telephone poles these 
results were secured: 

Table 12 

Frequency Distribution of 248,707 Telephone Poles, Classified 
According to Length of Life 


Length of life 

Number of poles 

{years) 

(frequency) 

0- 0.9 

1,150 

1- 1.9 

4,221 

2- 2.9 

10,692 

3- 3.9 

13,966 

4- 4.9 

16,633 

5- 5.9 

18,211 

f>- 6.9 

19,011 

7- 7.9 

19,260 

8- 8.9 

20,909 

9- 9.9 

19,879 

lO-lO.O 

20,764 

11-11.9 

15,454 

12^12.9 

14,237 

13-13.9 

13,779 

14-14.9 

9,764 

15-15.9 

8,534 

16-16.9 

7,659 

17-17.9 

6,918 

18-18.9 

4,591 

19-19,9 

1,798 

20-20,9 

815 

21-21.9 

313 

22-22.9 

102 

23-23.9 

47 


The table shows that 1,150 poles were scrapped during 
the first year of use, that 4,221 were scrapped after reaching 
the age of one year and before reaching the age of two 

^ ReplaceiEeat Insurance/* Edwin Kurtz. Administratioriy July, 1921, 

41 -- 69 . 
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years, and so on. This is simply a frequency table of the 
ordinary type. A much more significant arrangement for 
many purposes is secured when the figures are assembled 
cumulatively, as in the following table. 

Table 13 

Cumulative Distribution of 248,707 Telephone Poles, Classified 
According to Length of Life 


(Cumulated upward) 


Length of life 

Number of poles surmvi^ig 
{frequency) 

Less than 

1 year 

■ 1,150 '' 

ti 

it 

2 years 

5, 371 ' 

a 

(t 

3 

it 

16,063 

■ li 

it 

4 

it 

30,029 

it 

it 

5 

ti 

46,662 

it 

it 

6 

ti 

64,873 

it 

it 

7 

u 

'83,884 

it 

it 

8 

tt 

103,144 

(( 

' il 

9 

it 

124,053 

({ 

it 

10 

it 

143,932 

. <i 

it 

11 

ti 

164,696 

(t 

it 

12 

tt 

180,150 

ft 

it 

13 

it 

194,387 

({ 

ti 

14 

tt 

208,166 

ti 

it 

15 

tt 

217,930 

(i 

it 

16 

it 

.226,464 

it 

it 

17 

it 

234,123 

tt 

tt 

18 

it 

241,041 

■■ a 

ft 

19 

it 

245,632 

(t 

ti 

20 

it '■ 

247,430 

■ a ■ 

it 

21 

it 

■ ■ ' 248,245 

: U 

■ . ti 

22 

tt 

^248,558 

ti 

ti 

23 

u 

248,660 

it 

it 

24 : 

it 

248,707 


It is important to note that it is possible to cumulate a 
frequency series in two different ways. From the above 
table we may determine readily the number failing to at- 
tain any given age. It is often more convenient to reverse 
the process, so that the table will enable the total number 
above any given value to be unmediately determined. When 
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the telephone pole figures are thus cumulated doimward 
the following table is seciu’ed. 

Table 14 


Cumulative Distribution of 248,707 Telephone Poles, Classified 
According to Length of Life 
(Cumulated downward) 


(1) 

Length of life 

(2) 

Numher of poles surviving 
(frequency) 

(3) 

Per cent 

0. 

and more 

248,707 

100.0 

' 1 year 


it 

247,557 

99.5 

2 years 

it 

tt 

243,336 

97.8 

3 

a 

a 

a 

232,644 

93.6 

4 

it 

it 

a 

218,678 

88.0 

5 

a 

it 

a 

202,045 

81.2 

6 

a 

a 

ti 

183,834 

73.8 

7 


it . 

ft 

164,823 

66.3 

8 

it 

i( 

it 

145,563 

58.5 

9 

a 

it 

ti 

124,654 

50.1 

10 

a 

ti 

a 

104,775 

42.1 

11 

(( 

t( ' 

a 

84,011 

33.8 

12 

a 

a 

a ■ 

68,557 

27.6 

13 

a 

(t 

ft 

54,320 

21.8 

14 

it 

( ( 

a ' 

40,541 

16.3 

15 

a 


it 

30,777 

12.4 

16 

a 

a' 

it 

22,243 

8.9 

17 

a 

a- 

ti 

14,584 

5.9 

18' 

ft 

ft 

ft 

7,666 

3.1 

19 

tt 

it 

it 

3,075 

1.2 

20 

ti 

a 

it 

1,277 

0.5 

21 

it 

ii 

i t 

462 

0.2 

22 

a 

a 

ft 

149 

0.06 

23 ' 

tt 

i( 

if 

47 

0.02 

24 ;'' 

it 

a 

if 

0 

0.00 


Cumulative tables such as those given above have dis- 
tinct advantages in the handling of many types of data. Life 
tables are generally presented in this form. The scientific 
study of depreciation will lead to the construction of elab- 
orate “mortality tables” for various types of equipment, 
and these will be most useful in the cumulative form. It 
is frequently deshable to reduce the frequencies to per- 
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centages, as in column (3) of Table 14, though it should 
not be forgotten that the significance of the percentages 
depends upon the absolute numbers upon which they are 
based. 

THE OGIVE, OE CUMULATIVE FREQUENCY CURVE 

The general utility of such cumulated data is limited by 
the classification system necessarily adopted in condensing 

Nymber 

■■■■■■■ ot'Poles 

250,000 1 i r— 1 ! i i r 1 1 — 


200,000 


150,000 


100,000 


u 2 4 6 8 10 12 14 16 18 20 22 24 

Length of Life in Years 

Fi«. 34. — Cuniulative Frequency Curve: Distribution of Telephone Poles 
Classified according to Length of Life (Cumulated upward) 


material. Unless we interpolate mathematically we are 
limited to the points on the scale actually noted in the two 
tables. For this reason, a generalized cumulative curve 
similar to the smoothed frequency curve described in the 
preceding section is desirable. If the values given in Table 13 
be plotted on coordinate paper (the length of life in each 
vtm as abscissa, and the corresponding number of poles as 
ordinate) and a smooth curve drawn through the points 
thus plotted, the cumulative frequency curve shown in 
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Pig. 34 is secured. In Fig. 35 the data of Table 14 are 
plotted. 

Such a curve constitutes one of the most effective and 
useful representations of a frequency series. It is obvious 
that the limitations of the particular class-interval adopted 
are in large part removed; the shape of the curve will be 
fundamentally the same, though the class-interval and num- 

Number 
of Poles 



Fig. 35. — Cumulative Frequency Curve: Distribution of Telephone Poles 
Classified according to Ijcngth of Life (Cumulated downward) 

ber of classes may vary. Frequency curves of the usual 
type may not be compared unless the groupings are the 
same, but cumulative frequency curves are subject to no 
such restriction. Moreover, uneven class-intervals do not 
distort the ogive, or cumulative curve, as they do the ordinary 
frequency curve. 

The cumulative curve is particularly well adapted to 
interpolation. Thus if it is desired to know the number of 
poles surviving less than 15f years, the value of the ordi- 
nate of the curve having 15| as abscissa may be approxi- 
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mated from Fig. 34. A value of 222,000 is secured. If the 
number surviviug 8§ years or more is desired, a similar 
estimate may be made from Fig. 35. The interpolated 
figure in this case is 135,000. 

Another type of interpolation possible with such a curve 
is the determination of the number of cases falling within 
any given interval. One is not limited to the class-intervals 
marked out in the original tables. For instance, it may be 
desirable to know the number of poles surviving more than 
10| but less than 15 years. Reading from the table or from 
the chart we find that 217,930 poles survived less than 
15 years. Interpolating on the chart in the manner de- 
scribed above a figure of 154,000 is secured for the number 
surviving less than 10^ years. Subtracting the latter figure 
from the former we have 63,930 as the number of poles 
falling within the 10| to 15 years interval. The figure is, 
of course, an approximation to the true value, as are all 
values secured through such smoothing and interpolation. 

It should be noted that the ogive may be derived directly 
from the array, without the formation of a frequency table 
as an intermediate step. This curve, in fact, may be looked 
upon as merely a graphic representation of the array. It 
represents one of the simplest forms of statistical organi- 
zation, as well as one of the most effective methods of 
manipulating quantitative data. 


HELATION BETWEEN THE OGIVE AND THE FEEQUENCT 

CURVE 

The ogive and the frequency curve are merely two dif- 
ferent arrangements of precisely the same material, each 
arrangement having certain distinctive advantages. The 
characteristics of each may be more clearly apparent if the 
structural relationship between these two curves is under- 
stood. This relationship is graphically portrayed in Fig. 36.^ 

* The suH?estive arrangement shown in this figure was originated by Robert 
B. Chaddock. 
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Labor Cost per 1000 Feet 

Fig. 36. — Distribution of Sawmills in the United States Classified ac- 
cording to Labor Cost in 1921. Illustrating the Structural Relation 
between the Ogive and the Frequency Curve. 
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This figure is based upon the following frequency table, 
showing the distribution of sawmills in the United States, 
classified on the basis of labor cost per 1,000 feet of lumber 
produced.^ 

Table 15 


Frequency Distribution of 269 Sawmills in the United States Classified 
According to Labor Cost in 1921 

Labor cost {all employees) per Number of establishments 
1, (XX) feet, board measure (frequency) 


S1.00-$1.49 3 

1.50- 1.99 10 

2.00- 2.49 14 

2.60- 2.99 22 

3.00- 3.49 38 

3.50- 3.99 40 

4.00- 4.49 38 

4.50- 4.99 33 

5.00- 5.49 20 

5.50- 5.99 11 

6.00- 6.49 10 

6.50- 6.99 11 

7.00- 7.49 8 

7.50- 7.99 4 

8.00- 8.49 4 

8.50- 8.99 3 


269 

The upper part of Fig. 35 indicates the method by which 
the ogive is built up. Just as in the histogram, the area of 
each rectangle is proportional to the number of cases falling 
in the given class. Since the operation is a cumulative 
one, however, the base of each rectangle is the cumulated 
frequencies of all preceding classes. Thus the y-value (fre- 
quency) of the first rectangle is 3, erected from zero as a 
base, the y-value of the second class is 10, erected from 3 
as a base, and so on. The slope of the curve cormecting 
these rectangles is gradual at first when the frequencies 


‘Prom “Labor Efficiency and Productiveness in Sawmills,” Ethelbert 
Stewart, Monthly Labor Review, January, 1923, 14. Seven scattered cases 
above $9 .00 in value have been omitted from the table and the accompanying 
graph. 
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are low, then steeper as the frequencies become greater, 
and finally tapers off as the frequencies decrease near the 
upper limit of the distribution. This is the cumulative 
frequency curve, or ogive. 

When the various rectangles representing the class-fre- 
quencies are dropped to the zero line as a common base, 
the X- values remaining the same throughout, the histogram 
or column diagram described in an earlier section is secured. 
From this the frequency polygon or smoothed frequency 
curve may be derived. 
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CHAPTER IV 


DESCRIPTION OF THE FREQUENCY 
DISTRIBUTION: AVERAGES 

The classification of quantitative data and the construc- 
tion of a frequency distribution constitute an important 
stage in the task of organization and analysis. By means 
of classification the underlying structure of the data may 
be revealed and the essential unity of a mass of material 
may be brought out. But this is only the first step in statis- 
tical analysis. It remains to develop methods of measuring 
and expressing more concisely the significant characteristics 
of a body of data. For certain purposes the frequency dis- 
tribution itself must be summarized and condensed, must be 
boiled down until its essence has been distilled into three or 
four significant figures. 

If each frequency distribution constituted a novel and 
unique problem, obeying a law peculiar to itself, the task 
of studying and describing such distributions would be a 
difficult one. Fortunately this is not so. Quantitative 
data in widely different fields, when assembled in frequency 
distributions, show certain common characteristics, obey 
certain general laws. Experience in one field, therefore, 
constitutes a guide to work in others. Uniformity in the 
behavior of masses of data makes possible the development of 
a generalized method of organizing, analyzing and comparing 
measurements drawn from many fields of scientific study. 

Comparison op Frequency Distributions 

This fact of a common law of arrangement running through 
the universe of quantitative facts may be brought home 
most effectively by a comparison of distributions illustrative 
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of various types of data. The characteristics of the frequency 
distributions and of the frequency curves which follow should 
be noted j and the distributions compared. 



Height in Inches 


Fig. 37. — Frequency Curve: Distribution of 18,780 Soldiers Classified 
according to Height 

The curve in Fig. 37 is based upon the following data 
relating to the heights of 18,780 soldiers.^ 


Table 16 

Distribution of Soldm^s Classified According to Height 


Height in inches 
60 d"" 
d. 61 +■ 

• ,62 + 

63 + 

64 + 

65 d" 

66 d“ 


^ Prom G. C. 


N umber of soldiers 
197 
317 
692 
1,289 
1,961 
2,613 
2,974 


Height in inches 

67 + 

68 d" 

69 + 

70 d" 

71 + 

72 + 

73 + 


Number of soldiers 
3,017 
2,287 
1,599 
878 
520 
262 
174 
18,780 


Total 

Whipple, Vital BtatMicSi New York, 


Wiley, 1919, 377. 
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Fig. 38 depicts a frequency curve based upon 1,000 ob- 
servations, made at Greenwich, of the Right Ascension of 
Polaris.^ The values on the abscissa define deviations, in 
seconds of time, from an origin near the mean of all the 
observations. Frequencies of occurrence of given values on 
the a;-scale are measured, of course, as ordinates on the 


-3.5-3.0-2.5-2.0-1.5-1.0 -5 0 +.5+1.0+1.5+2.0+2.5+3.0+3.5 

Magnitude of Deviation in Seconds of Time 

Frequency Curve: Distribution of Errors of Observation 
Astronomical Measurements 


y-scale. The distribution plotted in Fig. 38 is given in 
Table 17 on page 89. 

If a piece of artillery be accurately adjusted on a given 
target (a point) and 100 shots be fired, it will be found 
that the points of impact of the hundred shots will be dis- 
persed about the target. No matter how accurate the piece 
or the adjustment only a small percentage of the shots 
will fall upon the exact point at which they were directed. 
The points of impact will be scattered about the target 
in a quite regular fashion, however. If a rectangle be so 
drawn as to include all the points of impact, and this rec- 

' E. T. Whittaker and G. Rolrinson, The CalcMltis of Observations, 
Dmdon, Hlackio and Son, 1924, 174. 
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Table 17 

Distribution of Errors of Observation in Astronomical Measurements 

(1000 observations of the Right Ascension of Polaris) 


Magnituds of deviation^ 
in seconds of time, from origin 

Number of observations 

3.5, 

2 

-3.0 

12 

- 2.5 

25 

-2.0 

43 

-1.5 

74 

- 1.0 

126 

-0.5 

150 

0 

168 

0.5 

148 

1.0 

129 

1.5 

78 

2.0 

33 

2.5 

10 

3.0 

2 

1,000 


tangle (or zone of dispersion) be divided into eight equal 
parts, the distribution of shots within these sections will be 
as indicated in Fig. 39. (In any given case there are likely 


2 

7 

16 

25 

25 

16 

7 

2 


Fig. 39. — Zone of Dispersion, Artillery Firing, Showing the Theoretical 
Percentage Distribution of Shots 


to be slight departures from this order, but in the long run 
this distribution will prevail.) 

This general rule holds for all classes of guns. The more 
accurate the gun the smaller will be the zone of dispersion, 
but the distribution within this zone is theoretically the 
same in all cases. Rules of fire used in artillery adjustment 
are based upon this fact. 
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The results of actual firing may be contrasted with this 
theoretical distribution. Table 18 presents a record of one 
thousand shots fired from a battery gun at the middle of a 
stationary target two hundred yards distant.^ The target 
was divided by horizontal lines into eleven equal divisions. 

Table 18 


Distribution of One Thousand Shots from a Single Gun 

Division 

Number of shots recorded 

1 (top) 

■ , I" .. 

■■ 2 

4 

3’.'. 

10 

'■"4 

■ ■ ■■ ■ 89 ■ ■■ 

■■■ - B'- 

190 

6 

212 

7 

204 

8 

193 

■9 

79 

10 

16 

11 (bottom) 

2 

1,000 


These results are presented graphically in Fig. 40. 

The zone of dispersion being divided into eleven divisions 
instead of the eight referred to in describing the theoretical 
distribution, a direct comparison cannot be made. We 
have here, however, the same general type of distribution 
found in the other examples given. A tendency toward 
concentration in the lower half of the target reflects a slight 
departure from sjrmmetry. 

When coins are tossed the distribution of heads and tails 
is assumed to be determined by pure chance. In a single 
experiment ten coins were tossed 100 times. The following 
table shows the frequencies with which given numbers of 
heads appeared. (The greatest number of heads possible 

* This experiment is recorded in the Report of the Chief of Ordnance, 1878, 
Appendix S. The results are given in The MOhod of Least Squares, Mansfield 
Merriman, New York, Wiley, 1897, 14. 
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in a given throw under such conditions is, of course, 10; 
it is also possible that no heads should appear.) 



Fig. 40. — Column Diagram: Distribution of 1,000 Shots from a Single 

Gun 
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Figure 41 depicts the above frequency distribution. 



Disteibution of Economic Data 

We find in these four widely different fields something 
approaching a uniform law of arrangement of quantitative 
data. The examples which have been given, however, do 
not represent the world of economic facts. Do economic 
data show the same general characteristics? If reference 
be made to the examples given in Chapter III, comparisons 
with the four preceding illustrations may be made. The 
frequency distributions referred to are those relating to 
weekly earnings of employees, the length of life of tele- 
phone poles, the distribution of labor cost in sawmills and 
the distribution of incomes below $4,000 in the United 
States. (The curve of the latter distribution, it should be 
noted, would show a long tail extending far to the right if 
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the incomes above $4,000 were included.) Several additional 
examples of economic data may be given. 

Figure 42 illustrates the order in which price variations are 
distributed. It is based upon a study made by W. C. MitcheU 
of 5,578 individual cases of change in the wholesale prices 



Percentage of Fall Percentage of Rise 


Fig. 42. — Frequency Polygon : Distribution of 5,540 Cases of Change 
in the Wholesale Prices of Commodities from One Year to the Next 
(after Mitchell) 

of commodities from one year to the next.^ Thus, for ex- 
ample, the average price of middling upland cotton in 
New York in a given year was $0,115 per pound. In the 
following year the average price was $0,128 per potmd, an 
increase of 11.3 per cent. This would constitute one entry 
in the table of rising prices, falling in the class “ 10-11 . 9 %.” 
The entire table consists of 5,578 such entries. These data 
are presented in Fig. 42 in the form of a frequency polygon, 
no attempt being made to smooth the curve. 

‘ From Bulletin S84, U. S. Bureau of Labor Statistics, Part I, “The Making 
and Using of Index Numbers,” 18. The figure shows the price changes only 
within the range of a 51 per cent fall and a 51 per cent rise. One case of a 
price fall of 55 per cent is not shown, and 37 cases of price increases ranging 
from 52 per cent to 104 per cent have not been included. 
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Table 20 shows the distribution of London-New York 
exchange rates (sterling exchange) from 1882 to 1913, in- 
clusive. This was a period when both currencies were freely 
convertible into gold, at fixed ratios, with customary market 
forces operating to keep exchange rates between the two 
“gold points.” Observations covering recent decades would 
show quite different characteristics. In the distribution 
shown graphically in Fig. 43 monthly rates have been 


■ ■ Dollars . . 

Fib. 43. — ■ Frequency Polygon: Distribution of London-New York 
Exchange Rates (as recorded over a period of .384 months) 

classified according to the frequency of their occurrence over 
thirty-two years of pre-war experience.^ 

A fairly typical distribution of wage-earners classified 
according to the amount of their weekly earnings, is shown 
in Table 21 and, graphically, in Fig. 44. The data relate to 
13,427 steel workers in open-hearth furnaces, in the United 

The hgures are . , . the averages of those quoted at the beginning of each 
month in the ^JcoTiomist; on and after July, 1886, the excliange is the Hele- 
^phie transfer,’ before that date, ‘short at interest.’ ” The data are taken from 
An A'ladernic Study of Some Money Market and Other StatMca, bv E G 
Peake. London. P. S. King, 1923. Appendix I. . v ■ 


Frequency 


V : ■■ Table ■20'/ ■ 

Distribution of London-New York Exchange Rates as Recorded by 
Months during the Period 1882-1913 


Class-interval 

'$4. 8275-14.8324 
4.8325- 4.8374 
4.8375- 4.8424 
4.8425- 4.8474 
4.8475- 4.8524 
4.8525- 4.8574 
4.8575- 4.8624 
4.8625™ 4.8674 
4.8675™ 4.8724 
4,8725™ 4.8774 
4.8775- 4.8824 
4.8825- 4,8874 
4.8875- 4.8924 
4.8925- 4.8974 
4.8975- 4.9024 
4.9025- 4.9074 
4.9075- 4.9124 


Frequency 

(number of months given 
rate prevailed) 

1 

6 

11 

21 

23 

24 

25 
40 
45 
49 
35 
45 
33 
16 

8 

1 

1 




8 16 24 32 40 48 56 64 72 80 88 96 104112120 
Class-fnterva! (in dollars per week) 

Fig. 44. — Distribution of Wage-Earners in Open-Hearth Furnaces, 
Classified according to Average Weekly Earnings in 1935 
■ 95 
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States, in 1935. There is a clear concentration of workers 
whose earnings fall between $16 and $24 a week. The dis- 
tribution is markedly skewed, however, with a tail extending 
far to the right. The range of weekly earnings, like that of 
incomes in general, is far greater above the mode than 
below. 

Table 21 

Distribution of Wage-Earners in Open-Hearth Furnaces in the United 
States, Classified According to Average Weekly Earnings in 1935 

(Total for aU districts) 


Class-interval 

Frequency 

(in 

dollars 

{number of workers 

per xoeek) 

earning stated amount) 

% O- 

$ 7.99 

583 

8~ 

15.99 

2,200 

16- 

23.99 

4,462 

24- 

31.99 

3,032 

32- 

39.99 

1,527 

40- 

47.99 

764 

48- 

55.99 

358 

56“ 

63.99 

210 

64- 

71.99 

144 

72- 

79.99 

44 

80— 

87.99 

36 

88- 

95.99 

21 

96- 

103.99 

26 

104- 

111.99 

3 

112- 

119.99 

7 

120- 

127.99 

1 

128- 

135.99 

9 


13,427 


The frequency curves and histograms based upon eco- 
nomic data, it will be noted, do not all show the symmetry 
and regularity which seem to characterize the curves rep- 
resenting physical data. Some are non-symmetrical, showing 
^ preponderance of cases on one side of the point of greatest 
concentration. In some there are breaks in the regularity of 
the increase or decrease of frequencies. But in spite of 
these differences there is obviously a family resemblance 
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between the measurements drawn from the fields of eco- 
nomics, astronomy, anthropometry, ballistics, and pure 
chance. Certain of the common characteristics may be noted. 

Geneeau Chakacteristics op Frequency Distributions 

There is, in the first place, wnafe'on in the values of the 
measurements secured. Human heights vary, astronomical 
measurements of the same quantity differ, projectiles fired 
under conditions as nearly constant as it is humanly possible 
to make them fail to land at the same spot, incomes vary 
as between individuals, and exchange rates move from week 
to week and month to month. The various observations or 
values secured in a given case are distributed along a scale, 
between two extreme values. 

The distribution of these values along the scale (the 
x-axis) is such that, moving from one extreme value to- 
wards the other, the cases found at successive points along 
the scale (the successive class frequencies) increase with 
more or less regularity up to a maximum, and then de- 
crease in much the same way. In spite of variation, there- 
fore, we find a central tendency, a massing of cases at certain 
points on the scale of values. This is the second notable 
characteristic which all the frequency distributions appear 
to possess in common. 

If we measure, for each of the successive classes, the 
amount of deviation along the scale from the point of 
greatest concentration it will be noted that small deviations 
are much more frequent than large ones, that extreme 
deviations are rare, and that deviations on both sides of 
the point of concentration reach perfect (or almost perfect) 
equality in the examples taken from the physical sciences 
and from the field of pure chance, and approximate equality 
in the economic distributions. (Exceptions to this rule 
of approximate equality on the two sides of the point of 
greatest concentration are not infrequent, the example of 
income distribution being a rather striking case in point.) : 



98 AVERAGES 

Figure 45 depicts a curve which is termed the “proba- 
bility curve,” or the “normal curve of error.” Its charac- 
teristics will be discussed in greater detail in a later section. 
At this point it is presented merely as a basic type which 
some of the above examples approach closely, and from 
which others of the examples represent more or less pro- 
nounced deviations. Departures from this type, let it be 
emphasized, are numerous and significant, but as a basic 



form this normal curve of error is extremely important in 
statistical work. Even the most important variations from 
this type resemble it with sufficient closeness to justify the 
use of a generalized method of describing frequency distri- 
butions. Distributions of quantitative data vary, and their 
variations from each other and from certain standard types 
are of the greatest significance, but in spite of their varia- 
tions a family resemblance runs through them all. Each 
new frequency distribution is not an isolated phenomenon, 
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but a member of a large family, and as such the problem 
of describing and analyzing it may be approached with 
confidence in methods which have been found applicable in 
other cases. 

Given this more or less common type, how may a given dis- 
tribution be described and differentiated from others? Certain 
methods will have been suggested by the preceding discussion. 

Methods of Desceibing a Fkequency Disteibtjtion 

The values of all the observations, it has been noted, are 
spread along a scale. The frequency distribution may be 
described by the selection of a single value on that scale 
which is thoroughly representative of the distribution as a 
whole. Since the frequencies vary, an obvious choice is 
the selection of that value which occurs the greatest number 
of times, or, in other words, that point on the scale at 
which the concentration is greatest. This value consti- 
tutes a measure of the central tendency of the distribution. 
Thus, one might find the income class in which the greatest 
number of people fall, and let the mid-point of that class 
(which is $950 in the distribution presented in Table 11) 
serve as the representative of the distribution. This most 
common value, it should be noted, is only one of several 
possible measures of the central tendency of a given dis- 
tribution. All such measures are termed atJeragres. 

A single representative value such as this has many uses 
but, by itself, it obviously leaves out many facts concern- 
ing the distribution. Of great importance is the character 
of the distribution about the average. Are the values of all 
tabulated cases closely concentrated, or is there pronounced 
dispersion over a wide range? The representative character 
of any average depends upon how closely the other values 
cling to it, upon the degree of concentration about the 
central tendency. The average, therefore, must be supple- 
mented by a, measure of variation, a measure of the “scatter” 
about the central value. 
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An adequate description should include also an account 
of the degree of symmetry of the distribution. It is highly 
important to know whether there is an equal distribution 
of cases on each side of the point of greatest concentration, 
or whether the frequency curve is skewed to one side, as in 
the case of income distribution illustrated above. If the 
emve is not symmetrical the degree of asymmetry should 
be determined, and for this purpose measures of skewness 
have been developed. 

It is, finally, possible to measure the degree of peaked- 
ness of frequency curves, by comparing them with the 
normal curve of error as a standard. It is obvious that 
the frequency polygon representing price changes (Pig. 41) 
would, if smoothed, constitute a curve much more peaked 
than the normal curve, and this fact of pronounced con- 
centration at the central value is highly significant. This 
characteristic of frequency curves is called kurtosis, and 
the measurement of kurtosis constitutes the final step in 
the description of the frequency distribution. 

When these various measures have been secured the task 
of statistical analysis will be well under way. The chaotic 
assortment of data with which we started will have been 
reduced to workable form in the shape of a frequency table, 
and the essential facts which the table reveals will have 
been distilled into three or four significant measures. This 
process not only reveals the characteristics of the given 
distribution, but also facilitates comparison with similar 
distributions. For example, it is impossible to compare 
some tens of millions of unorganized personal income figures 
for the United States with similar data for Great Britain. 
But if we secure a value for the average or most repre- 
•sentative income for each country, together with a descrip- 
lion of the distribution of personal inc'omes about that 
central value, a legitimate basis for comparative study is 
obtained. In manipulating and analyzing masses of ma- 
terial, whatever the purpose of study may be, full use 
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should be made of the power to condense, simplify and 
compare which is given by the measures employed in de- 
scribing the frequency distribution. 

The succeeding section is devoted to a discussion of one 
phase of this descriptive process, that concerned with the 
measurement of central tendencies. After the development 
of this subject of averages, problems relating to measures 
of variation and of skewness will be dealt with. 

Averages 

We have seen that the representation of a frequency 
distribution by an average, a single typical figure, is justi- 
fied because of the tendency of large masses of figures to 
cluster about a central value, from which the values of all 
observed cases depart with more or less regularity and 
smoothness. It is solely because of the concentration of 
cases about a central point on the scale that such repre- 
sentative figures have significance. The average represents 
the distribution as a whole only because it is a typical 
value. If the individual items entering into a distribution 
vary widely in value and show no tendency toward con- 
centration, no single value can represent them. Thus the 
arithmetic mean of the three numbers 3, 125, 1,000 is 376, 
but 376 in no way represents the three values on which it 
is based. This fundamental requirement, that there be a 
tendency toward concentration about a central value, must 
be met if an average is to be at all representative. 

If the general character of a frequency distribution be 
recalled the logic of one sort of average will be clear at 
once. It was suggested above that that point on the x-scale 
at which the concentration is greatest, that value which 
occurs the greatest number of times, might be taken as 
typical of the entire distribution. This value is termed 
the mode, and the group in which it falls is called the modal 
group. If a frequency curve be drawn to represent a given 
distribution, the mode will be the x-value corresponding to 
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the maximum ordinate.^ The maximum ordinate itself meas- 
ures the frequency of the modal group. Students frequently 
confuse these two values in determining the mode. It is 
not the distance along the y-scale but the distance along 
the ai-scale which measures the value of the mode. The 
ordinates merely measure the number of cases falling in the 
several classes, not the values of the cases falling in those 
classes. 

As typical of a given distribution we might also select 
that point on the scale of a:-values on each side of which 
one half the total number of cases fall. This value, which 
is called the median, is that which exceeds the values of 
one half the cases included, and is in turn exceeded by the 
values of one half the cases. Thus it has been estimated 
that in 1918 the median value of personal incomes in the 
United States wiis $1,140; one half of the 37 million recipi- 
ents of personal incomes received less than this sum, while 
one half rt'ceived more. When a distidbution is represented 
by a frequency curve, the area under the curve is divided 
into two equal parts by an ordinate erected at that point 
on the x-axis corresponding to the median value. This 
follows, of course, from the definition of the median, and 
from the fact that the area under a frequency curve repre- 
vsents the total number of eases included in the distribution. 

The arithmetic mean is a third type of average which 
may be used to represent a distribution. This is a calcu- 
lated average, affected by the value of every item in the 
distribution. Herein, obviously, it differs from the mode 
and the median, which depend primarily upon the relative 
position of the items in the frequency table, and are not 
affected by the values of all individual items. The arith- 
metic mean is the center of gravity of a distribution; it 
would be the x-value of the point of balance of a frequency 

* Strictly speaking, the mode is the w-value corresponding to the maximum 
ordinate of the ideal frequency curve which has been fitted to the given distri- 
butitm. 
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curve, if the curve could be blocked out and manipulated 
in solid form. 

The geometric mean and the harmonic mean are two other 
averages the characteristics of which will be discussed at a 
later point. 

The computation or location of these various averages 
may involve somewhat lengthy processes if the number of 
cases included be great. If appropriate methods be em- 
ployed, however, the labor of computation may be materi- 
ally cut down. The use of the following symbols will sim- 
plify the explanation of these methods: 

M: Arithmetic mean. 

Mo: Mode. 

Md: Median. 

m: The value of an individual observation; in a fre- 
quency distribution, the value of the midpoint of 
a clas.s. 

/.- The number of items (observations) in a given class 
in a frequency distribution. 

iV; The total number of items in a given series or fre- 
quency distribution. 

S (Sigma) : The symbol for the process of summation, meaning 
“the sum of. ” 

The Computation of the Aeithmbtic Mean 

Using the above notation, the formula for the arithmetic 
mean is 



Thus the mean of the measures 2, 5, 6, 7, is equal to the 

sum of these measures divided by 4, which is or 5. The 

computation of the arithmetic mean when each measure is 
reported at its true value is thus a simple process of sum- 
mation and division. The weekly earnings of 210 factory 
employees were listed in an earlier section. If these figures 
be added, and the total divided by 210, the mean weekly 
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wage is found to be $26,983. In this case the task of add- 
ing 210 items is somewhat tedious; it is a task which would 
become almost impossible if one were dealing with the 
37 million personal income figures, for example. For prac- 
tical reasons, therefore, it is usually necessary to compute 
the required averages from the frequency distribution rather 
than from the original ungrouped data. To exemplify this 
process we may utilize data relating to the weekly earnings 
of steel workers in the Pittsburgh District in 1935. 

The importance of certain of the precautions mentioned in 
the section on classification, in connection with the choice 
of a class-interval, will be clear from this example. When 
the mean of a distribution is calculated from classified ob- 
servations, we must assume an even distribution of cases 
within each class. The class-interval should be selected 
with this in mind, in order that errors introduced by the 
assumption may be minimized. If the items in each class 
are evenly distributed, the mid-value of each class may be 
taken as representative of all the observations included; 
when such a mid-value is multiplied by the number of items 
in the chiss, the product is approximately equal to the sum 
of all the individual itenas in the class. The formula for the 

^ / -T \ 

mean thus becomes M = — Table 22 illustrates the 

procedure in detail. 

The value secured in this way is sometimes called a 
weighted arithmetic mean. What we do, in effect, is to 
secure the arithmetic mean of the 28 figures in the column 
headed 7n. We do not take a simple average of these fig- 
ures, however, but weight each one in proportion to the 
number of cases falling in the class-interval of which it is 
the mid-value. It is precisely the procedure we should fol- 
low in calculating the mean of five men’s incomes, two of 
whom, let us say, have incomes of $2,000 and three of whom 
have incomes of $3,000. Clearly it would not do to add the 
figures $2,000 and $3,000, dividing the sum by two. The 
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Table 22^ 

Calculation of the Arithmetic Mean of Weekly Earnings of Workers 
in Open’-H earth Furnaces in the Pittsburgh District in 1935 


Class-interml 

Mid-point 

Frequency 

fm 

, dollars per week) 

m 

f 

$0™$3.99 

2 

67 

134 

4- 7.99 

6 

290 

1,740 

.8™ 11.99 

10 

437 

4,370 

"12-15.99 

14 

730 

10,220 

■ 16- 19.99 

18 

1,056 

19,008 

20- 23.99 

22 

1,009 

22,198 

24- 27.99 

26 

712 

18,512 

28-31.99 

30 

609 

18,270 

32- 35.99 

34 

334 

11,356 

36- 39.99 

38 

187 

7,106 

40- 43.99 

42 

179 

7,518 

44- 47.99 

46 

105 

4,830 

48- 51.99 

50 

60 

3,000 

52- 55.99 

54 

67 

3,618 

56- 59.99 

58 

28 

1,624 

60- 63 .,99 

62 

37 

2,294 

64- 67.99 

66 

33 

2,178 

68- 71.99 

70 

29 

2,030 

72- 75.99 

74 

16 

1,184 

76- 79.99 

78 

8 

624 

80- 83.99 

82 

3 

246 

84- 87.99 

86 

8 

688 

88- 91.99 

90 

4 

360 

92- 95.99 

94 

7 

658 

96™ 99.99 

98 

9 

882 

100-103.99 

102 

5 

510 

104-107.99 

106 

1 

106 

108-111.99 

110 

1 

110 

Total , 


6,031 

145,374 

M 

,S(/m) $145,374 

N 6,031 

= S24. 1045; 



^ These figures and siniilar data appearing in subsequent tables were com- 
piled by Edward K. Frazier, of the Division of Wages, Hours and Working 
Conditions, IT. S. Bureau of Labor Statistics. See “Earnings and Hours in 
Blast Furnaces, Bessemer Converters, Open-Hearth Furnaces and Electric 
Furnaces, 1933 and 1935^^ Monthly Labor Emm, April, 1936. The detailed 
statistics in Table 22 were provided through the courtesy of Dr. Isador Lubin, 
CommiBsi(>ner of Labor Statistics. 
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figaire $2,000 is given a weight of two, the figure $3,000 is 
given a weight of three, and the resultant sum, $13,000, is 
divided by five. Though the procedure in working from the 
frequency distribution is thus a form of weighting, the term 
“weighted average” is coming to have a more restricted 
meaning, to be explained at a later point, and should not 
in general be applied to an average computed from a fre- 
quency distribution. 

SHORT METHOD OP COMPUTING THE ARITHMETIC MEAN 

The calculation of the arithmetic mean from the fre- 
>,quency table is much easier, in general, than from the un- 
grouped data, but when the number of cases included is 
large even the computation from the frequency table by 
the method illustrated above may be laborious. The pro- 
cedure may be greatly simplified. 

From the method of computing the arithmetic mean it 
follows that the algebraic sum of the deviations of a series 
of individual magnitudes from their mean is zero. This 
: may be readily demonstrated. We may represent the series 
of magnitudes by mi, m^, Ws, . . . m„, their arithmetic 
jaaean by M, and the deviations of the various magnitudes 
r/from the mean by di, da, ds, . . . dn. 

IM' Then 

mi+'nh + m3 + . . . + m„ „ 

X ■ ly = ^ (1) 

'/I'and 

I'- mi -f Wa -f- ms -h . . . + m„ = NM. (2) 

The number of terms, of course, is equal to N. Therefore, 
' subtracting M N times from each .side of the equation, 

; («i, ~M) + (m 2 -M) + (ms-M) + . . . + (m„ - M) = 0. (3) 

lilfBut :- 

■•mk — M = di, nh — M ~ da, etc., and equation (3) may be written 

Sd = 0. 

this to be true we may mea.sure the deviations 
|M, a series of magnitudes from any arbitrary quantity, 
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secure the algebraic sum of the deviations, and from this 
value ascertain the difference between the arbitrary quan- 
tity and the true mean. For this difference will be the 
mean of the deviations from the arbitrary origin. If we 
let M' represent the arbitrary origin, or assumed mean, 
while c = ikf - M', and di , d/, d/ . • . d/, represent the 
deviations of the various magnitudes from M' (i.e., di 
= nil — M', dt = nil - M', etc.), then 

di — di + e, d^ — di + c, d-/ = ds + c, , . . d,,' = d» 4- c 


and 

But 

and 


2d' = 2d + Nc. 

2d = 0 
2d' = Nc 

c = 


From the known values of M' and c the value of the true 
mean may, be obtained, for M = M' -f c. The procedure is 
illustrated in the following simple example : 


Table 23 

Computation of the. A rithmctic Mean (Short Method) 
(IJii grouped datii) 


m 

/ 

d' 


5 ■ 

1 

— 15 

M' = 20 

15 

1. 

— 5 

2d' + 25 , 

25 

1 

+ 5 

N 5 

35 

1 

+ 15 

M = M' + c = 20-1-5 = 25 

■45 

1 

+ 25 



5 

+ 25 



When the deviations are measured from 20 as arbitrary 
origin there is in each case a constant error, if the devia- 
tion from the true mean be taken as standard. This error 
is equal to the difference between the true and the assumed 
means. The algebraic sum of the deviations from the 
assumed mean will equal N times this constant error, since 
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the error is repeated once for every item included. By 
dividing the sum of these deviations by N the amount of 
the error may be determined and the value of the mean 
thus obtained. 


Table 24 

Calculation of the Arithmetic Mean of Weekly Earnings of Woi'kers 
in Open-Hearth Furnaces in the Pittsburgh District in 1935 
(Short method) 

C/lOtSS'' Kfid ^ 

iMerval . , (in class- fd' Calculations 

(in dollars intend ^ + il/' = $30 

per week) ni f U7iits) 

$ 0-$ 3.99 2 07 — 7 469 1. Algebraic sum of devi- 

4- 7.99 6 290 — C 1,740 ations from il/' 

8- 11.99 10 437 - 5 

12- 15.99 14 730 - 4 

16- 19.99 18 1.05G - 3 

20- 23.99 22 1,009 - 2 

24- 27.99 26 712 - 1 

28- 31.99 30 609 0 

32- 35.99 34 334 + 1 

36“ 39.99 38 187 + 2 

40- 43.99 42 179 + 3 

44- 47.99 46 105 + 4 

48- 51.99 50 00 -f 5 

52- 55.99 54 67 + 6 

56- 59.99 58 28 +7 

60- 63.99 62 37 + 8 

64- 67.99 66 33 + 9 

68- 71.99 .70 29 + 10 

72- 75.99 74 16 + 11 

76- 79.99 78 8 + 12 

80- 83.99 82 3 + 13 

84- 87.99 S6 8 +14 

88- 91.99 90 4 + 15 

92- 95.99 94 7 + 16 

96- 99.99 98 9 + 17 

100-103.99 102 5 + 18 

104-107.99 106 1 + 19 

108-111.99 110 1 +20 

Total 6,031 


2. Calculation of c (in 
..... class-interval units) 

637 0,031 - 

420 

300 3. Reduction of c to orig- 
402 inal units 

196 Class-interval — $4 

c (in original units) 

297 « ~ 1.47388 X $4 

290 5.8955 


4. Determination of M 

M = J/' + c 

3/ * $30 - $5.8965 

M $24.1045 


The work of computation may be still further abbrevi- 
ated, for observations arranged in the form of a frequency 
distribution, by measuring the deviations in terms of the 
class-interval as a unit. Then, in finally applying the neces- 
sary correction, the difference between the true and assumed 
means may be again expressed in terms of the original units. 


THE MEDIAN 


109 


The method may be illustrated in detail with reference to 
the wage data for which the mean has already been calcu- 
lated. 

The steps in this process of calculating the arithmetic 
mean by the short method may be briefly summarized : 

1. Organize the data in the form of a frequency distribution. 

2. Adopt as the assumed mean the midpoint of a class near the 

center of the distribution. 

3. Arrange' a column showing the deviation (d') from the assumed 

mean of the items in each class, in terms of class-interval 
units. This deviation will be zero for the items in the class 
containing the assumed mean, — 1 for the items in the next 
lower class, + 1 for the items in the next higher class, and so 
on. 

4. ]Multii)ly the deviation of each class by the frequency of that 

class, taking account of signs. These products are entered 
in the column fd'. 

B. Get the algebraic sum of the items entered in the column fd'. 

6. Divide this sura by the total frequency (JV). The quotient is 

th(‘ correction (c) in class-interval units. 

7. Multiply the correction (c) by an amount equal to the class- 

inter\':d. The product is the correction in terms of the 
original units. 

8. Add this correction (algebraically) to the assumed mean {M ') ; 

the sum is the true mean (M). 

Location of the Median 

UNGBOUPED DATA 

The median is a value of a variable so selected that 
50 per cent of the total number of cases, when arranged in 
order of magnitude, lie below it and 50 per cent above it. 
For many frequency distributions this is a useful and sig- 
nificant value. 

When handling data which are not arranged in the form 
of a frequency distribution the location of the median is a 
simple matter. The data having been arranged in order of 
magnitude, it is necessary only to count from one end 
until that point on the scale of values is found which divides 
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the mim hp.r of cases into two equal parts. As a simple 
example we may assume that the following seven figures 
represent the annual incomes of seven individuals: 

$750 $975 $1,128 $1,450 $1,475 $1,825 $1,950 

The scale of values extends from $750 to $1,950, and 
seven items are arranged along this scale. The value of 
$1,000 has two items on one side and five items on the 
other, so obviously does not conform to our definition of 


|1500 

Median 


■income Scale in Dollars- 


Fig. 46. — Illustrating the Location of the Median with Ungrouped Data 
(Personal incomes of seven individuals) 

the median. The value of $1,450, which corresponds with 
the income of one of the seven individuals, is the median 
in this case. Three items lie on each side of this value; or, 
if we assume the central item to be cut in two, 31 items lie 
on each side of this point. This case is illustrated in Fig. 46. 
This diagram may help to bring out tlie fact that the 
median is a point on a scale so located that it cuts the 
frequencies in two. 

The problem is slightly different when an even number of 
cases is included. This condition is exemplified in the table 
on page 111 which shows the average earnings per man-hour 
in each of 38 selected industries during the year 1933. 

In this case the median must be a value on each side of 
which 19 industries lie. Therefore any value exceeding 
SO. 425 (average earnings in the prepared feed industry) 
and less than $0 . 426 (average earnings in the meat packing 


■ ” ' ..'a./; -I i ■/ i’, i- .■ ^ ; w i ; .i-'i,. 
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Table 25 

Average . Eafnifigs per Man-^Hour in Selected Manufacturing 

Industries ^ 

Average wage 

per man-hour 

Silk and rayon goods: Commission throwing $.278 

Cotton goods • 

Cigars • 

Silk and rayon goods: Commission weaving .313 

Bilk and rayon goods: Regular throwing . .316 

Knit underwear *319 

Knit outerwear • 358 

Cigarettes • 361 

Bilk and rayon goods: Regular weaving .369 

Wool shoddy -370 

Hosieiy • 372 

Cotton smaii wares • 378 

Wooleii goods *395 

Sugar, beet • 395 

Worsted goot Is : *399 

BniilT, and chewing and smoking tobacco .402 

Knit cloth • 414 

Rayon yarns . 421 

Feeds, prepared .425 

Meat packing . 426 

Pulp .431 

Ice, mariufaetiired . 436 

Flour milling . 444 

Paper .445 

Carpets and rugs, wool . 464 

Leather tanning - . 470 

Sugar rehning, cane . 481 

Soap . 482 

Blastfurnaces .488 

Felt goods , ■ . . 488 

Cereal preparations . 510 

Steel works ; . 519 

Motor vehicle iM'xlies and parts. . .561 

Machine tools ..585 

Motor vehicles .610 

Machine-tool a(?ceHsories . 621 

Petroleum reliiiliig .643 

Malt .657 

* From Monthly iMhur iiemeu\ October, 1935, 910, 
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industry) will satisfy the definition of a median. Under 
these conditions, where the median is really indeterminate, 
a value half-way between the two limiting values is accepted, 
by convention. The median of the 38 figures would thus 
be 10.4255. 

In this example the median value does not correspond 
with the earnings in any one industry. This will frequently 
be so when there is an even number of observations. 


GROtrPBD DATA 

The task of locating the median is essentially the same 
when the data are in the form of a frequency distribution. 
The fact that the real values of the individual items are 
not known, because of the grouping by classes, complicates 
the problem slightly. The data in Table 26, relating to 
advertising rates of daily newspapers in the United States, 
may be used in illustrating the method. 

Tablk 26 

Location of Median, Newspaper Adveriismg Rates in 1933 
Minimum Line Rates for National Advertising, 245 Daily Newspapers 
in Cities of 25,000 to 50,000 Population * 


Class-interval 
Rate per line 
(m cents) 

1 . 0 - 2,99 

3 . 0 - 4.99 

5 . 0 - 6.99 

7 . 0 - 8.99 

9 . 0 - 10.99 

11 . 0 - 12.99 

13 . 0 - 14.99 

15 . 0 - 16.99 


No. of newspapers 
charging stated 
rate 

f 

6 

53 

85 

56 

21 

16 

4 

__4 

245 


N 

2 

Md 


245 

2 


122.5 

< 63.5 


a.o+(^x.o) 


5 . 04 - 1.49 

6.49 


In the present case the location of the median involves 
the determination of that value on each side of which 122.5 

* Source: Editor und Publisher, Inteniaimial Yearkmk for 1933. 
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items lie. We may assume that we start at the lower end 
of the scale and move through the successive classes. When 
we reach the upper limit of the first class (that including 
items having values from 1.0 to 3.0) we have left behind 
us 6 cases, while 239 lie in front of us. When the upper 
limit of the second class is attained, 59 items have been 
passed. The upper limit of the third class has below it 
144 items. Somewhere between the lower and upper limits 
of the third class lies the desired point, that which has 
122.5 items on each side of it. How far must we move 
through this class, from 5.0 to 7.0 in order to reach this 
point? 

It will be recalled that, for purposes of calculation, the 
assumption is made that there is a uniform distribution of 
the items lying within any given class. Since before we 
reach the third class 59 cases have been counted, only 63 . 5 
of the 85 included in this class are needed to complete the 
desired number, 122.5. On the assumption of even distri- 
bution the required 63.5 cases will lie within a distance 
63 5 

on the scale equal to -~r- of the class-interval. The class- 
85 

63 5 

interval is 2.0; of 2.0 is equal to 1.49. As we move 
85 

up the scale, then, having reached 5.0, we proceed an addi- 
tional distance equal to 1 .49. At a point on the scale having 
a value of 6.49 is the dividing line on each side of which 
lie 122 . 5 cases. This is the value of the median. 

The process of computation is shown at the right of the 
frequency table. The following is a summary of the steps 
involved in the location of the median: 

1. Arrange the data in the form of a frequency distribution. 

2. Divide the total nunaber of measures by 2; this gives the 

number which must lie on each side of the point to be located. 

3. Begin at the lower end of the scale and add together the fre- 

quencies in the successive classes until the lower limit of the 
class containing the median value is reached. 
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4. Determine the number of measures from this class which must 

be added to the frequencies already totaled to give a number 
equal to iV/2. 

5. Divide the additional number thus required by the total 

number of cases in the class containing the median. This 
indicates the fractional part of the class-interval within which 
the required cases lie. 

6. Multiply the class-interval by the fraction thus set up. 

7. To the lower limit of the interval containing the median add 

the result of the multiplication process indicated in (6). 
This gives the value of the median. 

The last three steps constitute merely a simple form of 
interpolation. 

The entire process may be reversed by beginning at the 
upper end of the scale and counting downwards. In this 
case the final operation is one of subtraction from the upper 
limit of the interval containing the median. 

N/2 may be a fractional value, as in the example given, 
or a whole number. The operation is precisely the same in 
the two cases. 

Quaktiles and Deciles 

For many purposes it is desirable to locate on the scale 
of values, along which the items constituting a frequency 
distribution are ranged, points dividing the total number 
of measures in other ways. Similar to the median, which 
divides the total number of cases into two equal groups, 
are the quartiles, deciles, and percentiles. The quartiles, 
as the term implies, are points on the scale which divide 
the entire number of measures into four equal groups, the 
deciles divide the number into ten equal groups, and the 
percentiles divide the total number of cases into 100 equal 
groups. Thus the first quartile is that point on the scale 
below which one quarter of the total number of cases lie 
and above which three quarters of the total number of 
cases lie. The second quartile and the median are identical 
values. The third decile is that point on the scale below 
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which three tenths of the total number of cases lie and 
above which seven tenths of the total number of cases lie. 
In all cases the count begins at the lower end of the scale. 

Bxmi'ple: Location of the First Quartih (Qi), Netvspaper Advertising Rates 
, , . : (See Table 26) ■ 

■ iV/4 = 61.25 

Qi = 5.0+ (2.25/85 X 2.0) 

■ ■ ■ = 5.05 

Example: Location of Eighth Decile (A), Neivspaper Advertising Bates 

(See Table 26) 

iV/1.0 - 24.5 A = 7.0 + (52/56 X 2.0) 

8iV/iO = 196 =8.86 

A method of locating median, quartiles, deciles and per- 
centiles graphically is explained below. 

Location of the Mode 

The mode is the value of the x-variable corresponding to 
the maximum ordinate of a given frequency curve. The 
concept of a modal value is a thoroughly easy one to grasp. 
It is the most common wage, the most common income, 
the most common height. It is the point where the con- 
centration is greatest, a characteristic which is effectively 
brought out by Fechner’s term for this average, dichtester 
wert, or thickest value. It is not so easy, however, to locate 
the true modal value in a given case. In general statistical 
work an approximate value only is secured for the mode, 
but for most practical purposes this value is usually suf- 
ficiently accurate.^ 

The method of determining this approximate modal value 
may be illustrated by reference to the distribution shown 
in Table 27 on page 116. 

There is wide dispersion of the 22 cases falling below 40, 
and the existence of this “open-end” class makes it impos- 
sible to compute the mean, as the table stands. The mode 

* A method of locating the mode more accurately is explained in a later 

swtion. 
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Table 27 

Frequency Distribution of Five Per Cent Bonds 

(This table is based upon quotations on the New York Stock Exchange on 
June 13,. 1936, on railroad and industrial bonds with coupon rate of 5 per cent) 


Quoted price 

Mid-point 

Frequency 

Class-interval 

m 

/ 

Less than 40 


22 

40- 49.9 

45 

5 

50- 59.9 

55 

5 

60- 69.9 

65 

3 

70-79,9 

75 

8 

80- 89,9 

85 

9 

90- 99.9 

95 

19 

100-109.9 

105 

49 

110-119.9 

115 

10 

120-129,9 

125 

3 

130-139.9 

135 

1 


134 


is therefore an appropriate average to employ in the present 
instance. 

The class having limits of 100-109 . 9 contains the greatest 
number of cases. This appears to be the modal group, and 
the mid-point of this class, 106, may be tentatively accepted 
as the value of the approximate mode. But with different 
classifications quite different values might be secured for 
the mode. When the original bond quotations are tabulated 
with varying class-intervals the following results are secured. 
(Only the frequencies of the central classes are shown. It 
is not necessary, for this purpose, to present each of the 
tables as a whole.) 


(«) 


(6) 


(c) 


(d) 


Class-mterval « 5 

Class-interval « 

2.5 

Class-interval « 

2.5 

Class-interval * 

« I. 

Class-interval 

/ 

CIms-inlerval 

/ 

Class-mterval 

/ 

Class-mterval 


80- 84.9 

3 

90.0- 92.49 

4 

98. '75-101. 24.9 

6 

100-100.9 

■.,1 

'85- 89.9 

0 

92.5- 94.99 

6 

101.25-103.749 

17 

101-101.9 

2 

mr 94.9 

10 

95.0- 97,49 

2, 

103.75-106.249 

20 

102-102.9 

9 

95- 99.9 

9 

97,5- 99.99 

7 

106.25-108.749 

8 

103-103.9 

10 


29 

100.0-102.49 

9 



104-104.9 

7 

105-109,9 

20 

102.5-104.99 

20 



105-105.9 

6 

no-U4.9 

7 

106.0-107.49 

13 



106-106.9 

5 

11W19.9 

3 

107.5-109.99 

7 



107-107.9 

,.4. 
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With a class-interval of 5 a value of 102 . 5 is secured for 
the mode; with a class-interval of 2.5 a value of 103.75 is 
obtained. A class-interval of 2.5, again, but with different 
class limits, yields a mode of 105. Finally, a class-interval 
of 1 gives a mode of 103.5. Further changes in classification 
would give still other values. The mode thus appears to be 
a curiously intangible and shifting average. Its value, for 
the same data, seems to vary with changes in the size of the 
class-interval and in the location of the class-hmits. 

These difficulties arise primarily from limitations to the 
size of the sample being studied. The true mode, that 
value which would occur the greatest number of times in 
an infinitely large sample, could be located exactly if we 
could increase indefinitely the number of cases included. 
For, given sufficient cases, the approximate mode approaches 
the true mode as the class-interval decreases. Grouping in 
large classes obscures details, and as these classes are re- 
duced in size more of the details are seen and a truer picture 
of the actual distribution is secured. But since most prac- 
tical work is necessarily based upon relatively small samples, 
the increase in the number of classes reveals gaps and 
irregularities, and causes such a loss of symmetry and order 
that doubt arises as to where the point of greatest concen- 
tration really lies. The different tabulations of bond prices 
furnish an excellent example of this. 

By mathematical methods it is possible to obtain a value 
for the true mode without securing an infinite nximber of 
cases. The smoothing process has been briefly explained. 
One sort of smoothing involves the fitting of an appropri- 
ate type of ideal frequency curve to the data of a given 
frequency distribution. This gives, theoretically, the dis- 
tribution which would be secm’ed by the process first indi- 
cated, that of decreasing indefinitely the size of the class- 
interval and increasing indefinitely the number of cases. 
The value of the a;-variable corresponding to the maximum 
ordinate of this ideal fitted curve is the true mode. 
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For most practical purposes approximate values of the mode 
are adequate, and these may be secured by much simpler 
methods. A first and rough approximation may be obtained 
by taking the mid-value of the class of greatest frequency, a 
method suggested above. If the general rules for classifica- 
tion which were outlined in an earlier section have been fol- 
lowed, this procedure will not generally involve a gross error. 

It is possible, given a fairly regular distribution, to secure, 
by a process of interpolation within the modal group, a 
closer approximation than is obtained by accepting the mid- 
value of this group as the mode. Referring again to the 
tabulation of bond prices in Table 27 it will be noted that 
the distribution on the two sides of the modal class is not 
symmetrical. The modal class is that with a mid-value of 
105. The class next, below, with a mid-value of 95, contains 
19 cases, while that next above, with a mid- value of 115, 
contains but 10 cases. The disproportion is continued in 
the succeeding classes below and above, more cases being 
bulked below the modal class than above. For other pur- 
poses we have assumed an even distribution of cases between 
the upper and lower limits of each class, but it is probable 
that this is not true of the modal class in the present case. 
Judging from the distribution outside this class, it is likely 
that the concentration is greater in the lower half of the 
class-interval, that is, between 100 and 105. The mode, 
therefore, probably lies below the mid-value 105, rather 
than precisely at that point. We may attempt to locate !<■ 
within the group by weighting, assuming a pull toward the 
lower end of the scale equal to 19 (the number in the class 
next below) and a pull toward the upper end of the scale 
equal to 10 (the number in the class next above). This may 
be expressed by a formula, employing the following symbols: 

I == lower limit of modal class. 

/i = frequency of class next below modal class in value. 

/* = frequency of class next above modal class in value. 
i “ class-interval. 
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The interpolation formula is 
Mo = l + 


h 


/a +/i 


XI 


Applying this formula to the bond price data presented in 
Table 27, we have 


Mo == 100 + 


(ix.o) 


100 + 3.45 = 103.45. 


A closer approximation may sometimes be secured by bas- 
ing the weights (represented by and /i) upon the total 
frequencies of the two or three classes next above the 
modal class and the same number below. If three classes 
on each side are included in the present case, a value of 
102 . 8 is secured for the mode of bond prices. 

In some cases the problem of locating the mode is com- 
plicated by the existence of several points of concentration, 
rather than the single point which has been assumed in 
the preceding explanation. Thus in Table 9, representing 
the distribution of wages, with a class-interval of 25 cents, 
there are two definite modal points. A distribution of this 
type is called bi-modal; when plotted, a frequency cixrve 
liaving two humps is obtained. If the data are homogene- 
ous such a distribution is the result of paucity of data and 
of the method of classification employed. It may be due 
to the use of a class-interval too small, with respect to the 
number of cases included in the sample. An approximate 
mode may be determined in such cases by shifting the 
class-limits and increasing the class-interval, carrying on 
this process until one modal group is definitely established. 
This reverses the process by which the true mode may be 
located when the number of cases is infinitely large. Under 
such conditions the class-interval might be reduced until 
it was infinitely small. But with a limited number of cases 
the location of the point where the concentration is greatest 
necessitates increasing the size of the cla^-interval, in order 
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to get away from the irregularities due to the smallness of 
the sample. 

If the distribution remains bi-modal in spite of changes 
in the class-intervals and class-limits, it is probable that 
the data are not homogeneous, that two different distri- 
butions have by mistake been combined. Such cases are 
not uncommon in biometrical work. The existence of two 
distinct animal species where only one was suspected has 
been revealed in this way. The whole significance of a 
frequency distribution will be lost if the data are not homo- 
; geneous, a fact which is as true of work in the field of eco- 
nomic statistics as in any other. ^ ^ ^ 

DBTBEMINATION OF THE MODAL VALUE FEOM MEAN 
AND MEDIAN 

Another method of securing an approximate value for 
the mode, a method based upon the relationship between 
the values of the mean, median and mode, may be em- 
ployed in certain cases. In a perfectly symmetrical distri- 
bution mean, median and mode coincide. As the distribu- 
tion departs from symmetry these three points on the scale 
are pulled apart. If the degi’ee of asymmetry is only mod- 
erate the three points have a fairly constant relation. The 
mode and mean lie farthest apart, with the median one 
third of the distance from the mean towards the mode. If 
the asymmetry is marked, no such relationship may pre- 
vail. Having the values of any two of the averages in a 
moderately asymmetrical frequency distribution, therefore, 
the other may be approximated. In fact, however, the 
method should only be employed in determining the value 
of the mode, as the other two values may be computed 
more accurately by other methods. The value of the mode 
itself should only be determined in this way when more 
exact methods are not applicable or are not called for. 

The following formula is based upon this relationship: 
Mo = Mean — 3(Mean — Md). 
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Applying this formula to the telephone pole data shown 
in Table 12, the following result is secured: 

Mo = 9.33 - 3(9.33 - 9.015) = 8.385. 

This value is slightly below the mid- value of the modal 
class, 8 5, and is also less than the value 8.49 which is se- 
cured by weighting within the modal group (using four classes 
on each side). 

It must be emphasized that there is a fictitious accmacy 
to all these values for the mode. All the methods of locat- 
ing the mode which have been discussed are merely approx- 
imative, a fact not to be forgotten in interpreting and uti- 
lizing the results. 

Gbaphic Location of Mode, Median, Quartiles, and 

Deciles 

A better understanding of the frequency curve and of 
the cumulative frequency curve may be secured through a 
brief discussion of certain methods of locating graphically 
some of the statistical measures that have been described. 

The value of the mode may be readily determined from 
a frequency curve of the usual type, for, by definition, the 
mode is the reading on the horizontal scale corresponding 
to the maximum ordinate of such a curve. If this reading 
be taken from the frequency polygon a rough value will be 
obtained, the mid-value of the class of greatest frequency. 
A closer approximation to the true value of the mode will 
be secured from a curve which has been smoothed, either 
by inspection or by mathematical methods. Figure 47, 
showing a curve (smoothed by inspection) based upon the 
wage data presented in Table 8, indicates how the mode 
may be located graphically. The horizontal reading corre- 
sponding to the maximum ordinate of this curve is $27.50, 
an approximate value of the mode which may be compared 
with the values of $27 . 69 secured by the weighting process 
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and of $27.3470 secured from the values of the mean and 
median. 

The locations of the median and mean have been indi- 
cated on this chart. It has been pointed out that in mod- 
erately asymmetrical (or skewed) distributions there tends 



Median 


Mode 


Dollars 

Fig. 47. — Distribution of Weekly Earnings of Employees. A Smoothed 
Frequency Curve, showing the Relation between Mean, Median and Mode 

to be a constant relationship between the three averages 
which have been described, the median lying between the 
mean and the mode, and approximately one third of the 
distance from the former towards the latter. In the present 
case this relationship holds fairly well when the value of 
the mode is approximated from the smoothed curve. The 
irregularities in the original data render the process of 
smoothing by inspection rather arbitrary, however. 

In Fig, 48 the same data are represented by a cumulative 
frequency curve, based upon Table 28 on page 124. The steep- 
ness of a cumulative frequency curve within any given inter- 
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val depends upon the number of cases added within the cor- 
responding interval on the horizontal scale. Thus the curve 
rises gradually at first, then more steeply, and tails off 
gradually at the upper extremity. The value of the mode, 
obviously, is the reading on the horizontal scale correspond- 
ing to the point of greatest steepness. This is the point at 



Ilkustrating the Graphic Location of Median and Quartiles 

which the increase of frequencies is greatest, the point of 
greatest concentration in the frequency distribution. The 
value of the mode may be approximated from a smoothed 
frequency curve by locating the point at which the slope is 
greatest (which is a point of inflection) and taking the corre- 
sponding reading on the x-scale. In the present case a value of 
approximately $27 . 50 is secured for the mode by this method. 

\'alues for the median, quartiles, and deciles may also be 
secured graphically from the cumulative frequency curve. 
The smoothing of such a curve provides a quite satisfactory 
method of interpolation and, if the scale of the diagram 
is sufficiently large, accurate values may be obtained by 
this method. Locate on the vertical scale (the scale of 
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Table 28 

Cumulative Distribution of Wage-Earners in a Manufacturing 
Establishment 

(Classified on the basis of weekly earnings) 

, . Number earning stated amount 

Weekly earnings (frequmcy) 

Less than $22 .50 0 

“ “ 23.00 1 

“ “ 23.50 5 

“ “ 24.00 8 

“ “ 24.50 19 

“ “ 25.00 29 

“ “ 25.50 41 

“ “ 26.00 56 

“ “ 26.50 78 

“ “ 27.00 98 

“ “ 27.50 122 

“ “ 28.00 152 

“ “ 28.50 169 

“ “ 29.00 186 

“ “ 29.50 193 

“ “ 30.00 199 

“ “ 30.50 204 

“ “ 31.00 208 

“ “ 31.50 209 

“ “ 32.00 209 

“ “ 32.50 210 


cumulative frequencies) a point distant from the base by 

If from this point a horizontal line be extended to the cumula- 
tive curve, the abscissa of the point of intersection will be 
the value of the median. This value may be easily deter- 
mined by dropping a vertical line from the point of inter- 
section to the rc-axis. Figure 48 illustrates the application 
of this method. A value of $27 . 125 is secured for the median 
by this method. By direct interpolation a value of 127 . 1458 
is obtained. The quartiles may be located in precisely the 
same way, the vertical scale being divided into quarters 
and horizontal lines extended to the cumulative cm-ve from 
the points thus located on the vertical scale. 
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For some purposes, particularly those that involve the 
averaging of rates or ratios rather than quantities, none of 
the averages which have been described is suitable. The 
geometric and the harmonic means are types of averages 
that should be familiar because they are particularly ap- 
propriate for such purposes. 

The Geometric Mean 

The geometric mean is the nth root of the product of n 
measures; its value thus is represented by: 

ikf, = -^ai-Oz-az . . . a„. 

The geometric mean of the numbers 2, 4, 8, is 

ilf „ = -^2X4X8 
= 

= 4. 

It is obvious from the method of computation that if 
any one of the measures in the series has a value of zero the 
geometric mean is zero. 

The actual computation of the geometric mean is greatly 
facilitated by the use of logarithms. In this form 

T “1 + log aa -f log as ■+■ . . . -f log o„ 

IMg Mg 

The logarithm of the geometric mean is equal to the arith- 
metic mean of the logarithms of the individual measures. 

When the measures, of which the geometric mean is de- 
sired, are to be weighted, the separate weights are intro- 
duced as exponents of the terms to which they apply. Thus 
if we represent the sum of the weights by N and the weights 
corresponding to the terms ai, a^, . . . a„, respectively, 

by nji, Wi, Ws . ■ . Wn, the formula for the geometric mean is 

Mg = -O'ai”’* • 02®* • a3“’> . . . a„”“. 

This is equivalent to repeating each term a number of 
times, the number corresponding to the amount by which 
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it is weighted. (This, of course, is precisely what is done 
in securing a weighted arithmetic mean.) When logarithms 
are employed the formula for the weighted geometric mean 
becomes 

, , , wi log ai + Ws log 02 + ws log as . . + w,, log 

bog M, — 

A method of computing the geometric mean may be 
illustrated with reference to Table 29, which shows the 
distribution of the prices of 66 preferred stocks paying 
seven per cent dividends. The table is based upon closing 
prices on the New York Stock Exchange and the New York 
Curb Exchange for the week ended July 25, 1936. 

Table 29 


Computation of the Geonwtric Mean of Preferred Stock Prices 


Class-mtemil 

m 

/ 

log m 

/log rn 

$ 70-S 89.9 

80 

5 

1.90309 

9.51545 

90- 109.9 

100 

20 

2.00000 

40.00000 

no- 129.9 

120 

27 

2.07918 

56.13786 

130- 149.9 

140 

6 

2.14613 

12.87678 

150- 169.9 

160 

8 

2.20412 

17.63296 

Log Mg - 

136.16305 

66 

66 

136.16305 

Log Mg = 2.06308 

Mg = 115.63 


CHARACTERISTICS OF THE GEOMETRIC MEAN 

The nature of the geometric mean may be understood 
by considering its relation to the terms it represents, as an 
average. 

If the arithmetic mean of a series of measures replace 
each item in the series, the sum of the measures will remain 
unchanged. Thus, the sum of the numbers 2, 4, 8 is 14. 
The arithmetic mean of these three numbers is 4§; if this 
value be inserted in the place of each of the three measures 
the sum remains 14. It is characteristic of the geometric 
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mean that ikxB product oi a series of measures will remain 
unchanged if the geometric mean of those measures replace 
each item in the series. Thus the product of 2, 4, 8 is 64, 
The geometric mean of the three numbers is 4; if this value 
replace each of the three measures the product remains 64. 

Again, it is true of the arithmetic mean that the sum of 
the deviations of the items above the mean equals the 
sum of the deviations of the items below the mean (disre- 
garding signs). The sums of the differences between the 
individual items and the mean are equal. In the case of 
the geometric mean the products of the corresponding ratios 
are equal. If the ratios of the geometric mean to the meas- 
ures which it exceeds be multiplied together, the product 
will equal that secured by multiplying together the ratios 
to the geometric mean of the measures exceeding it in value. 
For example, the geometric mean of the numbers 3, 6, 8, 9 
is 6. The following equation may be set up: 

6 6^8 9 

3 ^ 66^6 

The last example brings out the most important charac- 
teristic of the geometric mean. It is a means of averaging 
ratios. Its chief use in the field of economic statistics has 
been in connection with index numbers of prices, where 
rates of change are of major importance. A rise in prices 
represented by the change from 50 to 100 is as important 
as a rise from 100 to 200. Yet this equivalence is not brought 
out by the arithmetic mean, which gives double weight 
to the change which involves an absolute difference of 100. 
An example frequently cited is that of two cases of price 
change, one a ten-fold increase, from 100 to 1,000, the other 
a fall to one tenth of the old price, from 100 to 10. The 
arithmetic mean of 1,000 and 10 is 505, the geometric 
mean is -v/EOOO X 10, or 100. When the average is of the 
latter type it is seen that the two equal ratios of change 
have balanced each other. The arithmetic mean, 505, is 
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quite incorrect as a measure of average ratio of price change. 
This subject is discussed at greater length in the chapter on 
index numbers. 

What has been said in an earlier section in regard to the 
advantages of logarithmic charting for certain purposes 
bears upon the use of the geometric mean. This average 
is sometimes called the logarithmic mean, as its logarithm 
is simply the arithmetic mean of the logarithms of the 



constituent measures. Wherever percentages of change are 
being averaged, where ratios rather than absolute differ- 
ences are significant, the use of the geometric mean is 
advisable. 

A problem involving the use of the. geometric mean arises 
in computing the average rate of increase of any sum at 
compound interest. If p<, represent the principal at the 
beginning of the period, p,. the principal at the end of the 
period, r the rate of interest and n the number of years 
in the period, the sum to which po will amount at the end 
of the n years, if interest is compounded annually, is repre- 
sented by the equation : 


It follows from this that 


Thus, if $1,000 at compound interest amounts to $1,600 
at the end of 12 years, there has been an increase of 60 per 
cent. The arithmetic mean is 5 per cent, but this is not the 
rate at which the money increased. The true rate is: 

^ = //£!§ _ 1 
y 1,000 ^ 

= *^1760 - 1 
= 1.04-1 
= .04, or 4% 
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Precisely the same problem arises whenever rates of in- 
crease or decrease are to be averaged. The use of the arith- 
metic mean gives an incorrect result. 

THE GBOMETBIC MEAN AS A MEASURE OP CENTRAL 
TENDENCY 

A question arises as to the type of frequency distribution 
the central tendency of which would be best represented 
by the geometric mean. When the absolute measures, 
plotted on the arithmetic scale, give a fairly symmetrical 
distribution, the arithmetic mean is clearly preferable to the 
geometric mean. But when the absolute figures thus plotted 
give an asymmetrical frequency curve of such a type that 
the as3Tnmetry would be removed and a symmetrical curve 
secured by plotting the logarithms of the measures, the 
geometric mean would appear to be preferable. Such a 
distribution would be one in which not the absolute devia- 
tions about the central tendency but the relative deviations, 
the deviations as ratios, were symmetrical. The arithmetic 
mean of the logarithms of the various measures (which 
value is, as has been shown, the logarithm of the geometric 
mean of the original measures) would be the best representa- 
tive of the central tendency in such a distribution. The 
curve thus plotted would be symmetrical about the logarithm 
of the geometric mean. A frequency curve representing 
the logarithms of percentage changes in prices would tend 
to show this symmetry about the logarithm of the geometric 
mean of these changes. These percentage changes, as nat- 
ural numbers, group themselves in an asymmetrical form, 
with the range of deviations above the arithmetic mean 
greatly exceeding the range below. ^ This arises, of course, 
from the fact that prices of given commodities may increase 
1,000 per cent or more from a given base, but cannot fall 
more than 100 per cent from any given base. The section 
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on index numbers contains a fuller discussion of this partic- 
ular phase of the subject. V 

The construction of a frequency distribution in which loga- 
rithms are tabulated would be laborious, if the logarithm 
of each item to be entered had to be determined, before 
tabulation. It is possible, however, with no great trouble to 
construct a true logarithmic distribution, with class-interval 
constant in terms of logarithms. The 66 quotations on pre- 
ferred stocks, tabulated in Table 29, range from 74 to 166. 
The logarithm of 74 is 1.86923; the logarithm of 166 is 
2.22011. The range, in logarithms, is .35088. We may 
select .06 as a suitable logarithmic class-interval, for the 
present purpose. For convenience in tabulating the data 
we set up two series of class limits, one in terms of logarithms, 
one in terms of the corresponding natural numbers. In 
constructing the distribution natural numbers may be tab- 
ulated, utilizing the class limits defined in natm-al terms. 
All .subsequent calculations may be carried through in terms 
of logarithms. The distribution appears in Table 30 on 
page 131. 

If the geometric mean is considered appropriate for a given 
series, the type of distribution represented by Table 30 is 
more logical than that shown in Table 29, and the descrip- 
tive measurements secured from Table 30 have correspond- 
ingly greater validity. We may derive the mean of the 

1 C. M. Walsh, in The Problem of EsHviatlon (I.^ndon, P. S. King & Son, 
1921) 35, lays down the following criteria for the use of averages: 

(а) When there are no conceivable or assignable upper or lower limits to the 

vrIuCkS of the terms in a series, the arithmetic average should be em- 
ployed. ■ . . . • . 

(б) W'hen there is a definite lower limit at or above zero and no upper conceiv- 

able or assignable limit, the geometric average should be employed. 
Because this is true of price changes Walsh believes the geometric 
average to be the correct one to use in making index numbers of prices, 
(c) When in practice, or in the nature of things, certain upper and lower limits 
are found to exist and the above criteria cannot be employed, a study of 
the actual dispersion of the data is neeessaiy. In this case, if the mode is 
fourui nearer to the arithmetic average, that avei'age should be em- 
ployed; if tlie mode is found nearer to the geometric average, that aver- 
se should be used. 
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Table 30 

DiMribuHori of Prices of Preferred Stocks 
Paying Seven Per Cent Dividends 


Class-interval 

Class4nterval 

Mid-point 

Frequency 


(natural numbers) 

(logaiithnis) 

(logarithms) 

pv. 



m 

f 

$ 70 . 80 -$ 81.27 

1 . 85 - 1.9099 

1.88 

2 

3.76 

81 . 28 - 93.32 

1 . 91 - 1.9699 

1.94 

4 

7.76 

93 . 33 - 107.15 

1 . 97 - 2.0299 

2.00 

12 

24.00 

107 . 16 - 123.02 

2 . 03 - 2.0899 

2.06 

30 

61.80 

123 . 03 - 141.24 

2 . 09 - 2.1499 

2.12 

6 

12.72 

141 . 25 - 162.17 

2 . 15 - 2.2099 

2.18 

7 

15.26 

162 . 18 - 186.20 

2 . 21 - 2.2699 

2.24 

5 

11.20 




66 

136.50 


logarithms of the preferred stock prices by dividing S/m 
of Table 30 (136.50) by 66. The value is 2.06818. The 
anti-log of this is 116.97, which is the geometric mean of 
the distribution. This differs somewhat from the value 
$115.63 secured from Table 29. The difference is due, in 
part, to the use of different class-intervals and class limits 
in the two cases. With a relatively small number of observa- 
tions such differences would be expected to lead to different 
results. Differing assumptions concerning the internal dis- 
tribution of items within the several classes would also 
contribute to a discrepancy between the two results. The 
value obtained from Table 30 is probably a closer approxima- 
tion to the actual geometric mean than that obtained from 
Table 29. 

A frequency curve based upon the logarithms of the 
measures included rather than upon the natural numbers, 
has been employed to advantage in plotting data relating 
to income distribution. When natural numbers are plotted, 
the range of income distribution is so large that it is physi- 
cally impossible to prepare a chart that will reveal the char- 
acteristic features of all sections of the curve. The process 
of plotting on double logarithmic paper (which is, of course, 
equivalent to plotting the logarithms of both x’s and y’s) 
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meets this difficulty, giving a true impression of the whole 
distribution and the relations between its parts, and, at the 
same time, brings out certain important features that are 
obscured in the natural scale chart. In particular, this 
device appears to smooth into a straight line that part of 
the curve lying above the mode, a fact which led Vilfredo 
Pareto to enunciate what has been known as Pareto’s Law 
concerning income distribution.. An intensive study of the 
distribution of income in the United States has led the staff 
of the National Bureau of Economic Research to call into 
question certain conclusions drawn from Pareto’s generaliza- 
tions, though the value of the double logarithmic scale for 
the presentation of income data has been recognized. 

The Harmonic Mean 

, The harmonic mean is a type of average capable of 
application only within a restricted field, but which should 
be employed to avoid error in handling certain types of 
data. It must be used in the averaging of time rates and 
it has distinctive advantages in the manipulation of some 
types of price data. The following example will illustrate 
the method of employing this average. 

A given commodity is priced, in three different stores, at 
“four for a dollar,” “five for a dollar” and “twenty for a 
dollar.” The average price per unit is required. The arith- 
metic average of the figures given (4, 5, and 20) is 9|. If 
we take this to be the average number sold per dollar, the 
average price would appear to be $1.00 -f- 9|, or 10i-§ cents 
each. But the original quotations are equivalent to unit 
prices of 25 cents, 20 cents, and 5 cents; the arithmetic 
average of these prices is 16f cents apiece. The discrepancy 
between lOM cents and 16| cents is due to a faulty use 
of the arithmetic mean in averaging quotations in the “so 
many per dollar” form. Such a mean is, in effect, a weighted 
average, with greater weight being given to quotations 
involving a larger number of commodity units. 
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The correct result may be secured by taking the harmonic 
mean of the three original quotations. The harmonic mean 
of a series of numbers is the reciprocal of the arithmetic mean 
of the reciprocals of the individual numbers. Thus if we repre- 
sent the numbers to be averaged by ri,r 2 . . . r„, the formula 
for the harmonic mean, H, is 


1 ^ 

H 


--b- + - + . . . + - 
n r 2 Vi r„ 


Using the figures just quoted: 


± -1- ± 4. 

I = ^ 5 20 

H 3 


15 = 1 
60 6 


H = 6. 


The harmonic mean of 4, 5, and 20 is 6, the average number 
of units sold per dollar. The average price per unit is 
16| cents. 

The computation of the harmonic mean of a series of 
magnitudes is greatly facilitated by the use of prepared 
tables of reciprocals.* 


Relations between Different Averages 

When different averages are located or computed for a 
given series of magnitudes, certain relationships between 
them are found to prevail. 

1. The arithmetic mean, the median and the mode coincide in 

a symmetrical distribution. 

2. In a moderately a-symmetrical distribution the median lies 

between the mean and the mode, approximately one third 
of the distance along the scale from the former towards the 

‘ Barlmv’s Tables of Squares, Cubes, Square Roots, Cube Roots and Reciprocals, 
New York, Spar and Chamberlain. 
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■ Jatter. Hence, for this type of distribution there is an ap- 
proximation to the following relationship : 

3. The arithmetic mean of any series of magnitudes is greater 

than their geometric mean. 

4. The geometric mean of any series of magnitudes is greater 

than their harmonic mean. The only exception to the last 
two rules is found when all the measures in the series are 
equal, in which case arithmetic mean, geometric mean and 
harmonic mean are equal. 

5. The geometric mean of any two terms is equal to the geometric 

mean of the harmonic and arithmetic means of those terms. 
Thus if the terms be 2 and 8, the harmonic mean is 3i, the 
geometric mean 4, and the arithmetic mean 5. But 4 is also 
geometric mean of and 5. This relationship does not 
hold when the series includes more than two terms, unless 
the terms constitute a geometric series. 

6. Whcm th<^ dispersion of data follows the arithmetic law, the 

mode and median will generally be found closer to the 
arithmeti(‘ than to the geometric average. When the dis- 
persion follows the geometric law the mode and median will 
generally be found closer to the geometric than to the arithme- 
tic average. 

Chaeactiseistic Featukkb of the Chief Averages 
The arithmetic mean 

1 . The value of the arithmetic mean is affected hj every measure 

in the series. For certain piiqroses it is too much affected by 
extreme deviations from the average. 

2. The aritlmietic mean is easily calculated, and is determinate 

in every ease. 

3. The arithmetic mean is a computed average, and hence is 

capable of algebraic manipulation. 

The median 

1. The value of the median is not affected by the magnitude of 

extreme deviations from the average. 

2. The median may be located when the items in a series are not 

capable of quantitative measurement. 

3. The mc*dian may be located when the data are incomplete, 

provided that the number and general location of all the 
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! cases be known, and that accurate information be available 

i concerning the measures near the center of the distribution. 

' 4. The median is not as well adapted to algebraic manipulation 

I as the arithmetic, geometric and harmonic means. 

I ■ ■; 

I The., mode , 

* L The value of the mode is not affected by the magnitude of 

extreme deviations from the average. 

2. The approximate mode is easy to locate but the determination 

of the true mode requires extended calculation. 

3. The mode has no significance unless the distribution includes a 

large number of measures and possesses a distinct central 
tendency. 

4. The mode is the average most t3Tical of the distribution, 

^ being located at the point of greatest concentration. 

5. The mode is not capable of algebraic manipulation. 

The geometric 77iean 

I 1. The geometric mean gives less weight to extreme deviations 

than does the arithmetic mean. 

2. It is strictlj" determinate in averaging positive values. 

3. The geometric mean is the form of average to be used when rates 

of change or ratios between measures are to be averaged, 
as equal weight is given to equal ratios of change. It is par- 
ticularly well adapted to the averaging of ratios of price 
change. 

4. The geometric mean is capable of algebraic manipulation. 

The harmonic meccti 

f 1. The harmonic mean is adapted to the averaging of time ratevS 

and certain similar terms. It has been employed in the 
field of economic statistics in the manipulation of price data. 

2. The labor of computing the harmonic mean and its unfamiliarity 

detract from its usefulness in ordinary statistical analysis. 

3. The harmonic mean is capable of algebraic manipulation. 

This summary has been designed to show that each 
type of average has its own particular field of usefulness. 
Each one is best for certain purposes and under certain 
conditions. The characteristics and limitations of each one 
should be understood in order that it may be appropriately 
employed. A complete description of a frequency distribu- 


i 
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tion frequently calls for the determination of two or three 
of the chief averages, as well as other statistical measure- 
ments. The arithmetic mean is perhaps the most useful 
single average. The simplicity of its computation, the 
possibility of employing it in algebraic calculations and 
the fact that its meaning is perfectly definite and familiar 
make it highly serviceable in statistical work. Its sphere 
of usefulness is not universal, however, and it should only 
be employed when the given conditions render it suitable. 
A fuller appreciation of the distinctive virtues of the geo- 
metric moan is leading to a wider employment of that 
measure in many types of statistical work. A discriminat- 
ing use of averages is essential to sound statistical analysis. 
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CHAPTER V 


DESCRIPTION OF THE FEEQUENCY 
DISTRIBUTION: MEASURES OF 
VARIATION AND SKEWNESS 

In the preceding chapters we have been concerned, first, 
with methods of reducing a mass of quantitative data to a 
form in which the characteristics of the mass as a whole 
may be readily determined and, in the second place, with 
methods of describing the assembled data. The first ob- 
ject is accomplished with the formation of a frequency 
distribution. The second is partially accomplished when 
there has been obtained a single significant value in the 
form of an average which represents the central tendency 
of the distribution. But any average, by itself, fails to give 
a complete description of a frequency distribution. Three 
other values are needed before the chief characteristics of a 
given distribution have been measured, and comparison 
with other distributions is possible. The first of these is a 
measure of the degree to which the items included in the 
original distribution depart or vary from the central value, 
the degree of “ scatkr,” variation or dispersion. The second 
is a measure of the degree of symmetry of the distribution, 
of the balance or lack of balance on the two sides of the 
central value. The third is a measure of kurtosis, of the de- 
gree to which there is a bunching of cases at the modal 
value. The present chapter deals with various measures of 
variation and skewness. The method of measuring kurtosis 
is referred to at a later point. 

Nature .\nd Significance op Variation 

The fact of variation in collections of quantitative data 
has been pointed out in earlier sections and the bearing of 
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this fact upon the work of the statistician indicated. Prac- 
tically every collection of quantitative data, consisting of 
measurements from the social, biological, or economic field, 
is characterized by variation, by quantitative differences 
among the individual units. And this fact of variation is 
as important as the fact of family resemblance. Biological 
variation has been a fundamental factor in the evolutionary 
process. No measurement of a physical characteristic of a 
racial group, such as height, is complete without an ac- 
companying measure of the average variation in the group 
in this respect. The average income in a country is perhaps 
of less significance than the variation in income, the differ- 
ences between the incomes received by different economic 
classes. Price variations interrupt the normal functioning 
of the economic system, causing hardship to some and 
giving unearned profits to others, because the various ele- 
ments in the price system are unequally affected. Not 
changes in the general level of prices but differences among 
changes in the prices of individual commodities and services 
cause trouble. 

An average, by itself, has little significance unless the 
degree of variation in the given frequency distribution is 
known. If the variation Is so great that there is no pro- 
nounced central tendency an average has no significance. 
Withi a decrease in the degree of variation an average 
becomes increasingly significant. Whether a single fre- 
quency distribution is being described, therefore, or com- 
parison is being made with other distributions, a measure of 
central tendency must be supplemented by a measure of 
variation. 

Mbasuees of Absolute Variation 

Variation may be expressed in terms of the units of 
measurement employed for the original data, or may be 
expressed as an abstract figure, such as a percentage, which 
is independent of the original units. When the original 
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units are employed absolute variability is measured; when 
an abstract figure is secured we have a measure of relative 
variability, more suitable for comparison than the former 
type. Measures of absolute variability are first considered. 

THE EANGE 

A rough measure of variation is afforded by the range, 
which is the absolute difference between the value of the 
smallest item and the value of the greatest item included 
in the distribution. Table 20 in Chapter IV shows the dis- 
tril)ution of London-New York monthly exchange rates 
during the period 1882-1913. The smallest item among the 
oi’iginal figures included in the table is $4.83; the greatest 
is $4,908. The range, therefore, is $4.908-$4.83, or $.078. 
A distance on the scale equal to $.078 will include every 
item. If the original data were not to be had the range 
could be approximated from the frequency table. It would 
be the difference between the lower limit of the class at the 
lower extreme of the distribution, and the upper limit of 
the class at the upper extreme, or $.085 in the present 
case. 

The value of the range, it is obvious, depends upon the 
values of the two extreme cases only. A single abnormal 
item would change its value materially. Because it is 
erratic and is likely to be unrepresentative of the true 
distribution of items, it is seldom used in statistical work. 
The range is frequently employed as a measure of stock 
market fluctuations, though its adequacy for this pmpose 
may be questioned. 

THE MEAN DEVIATION 

A more accurate measure of the dispersion of items about 
a central value is afforded by the simple device of measuring 
the deviation of each item from this central value and aver- 
aging these deviations. The simple example in Table 31 
illustrates the method of computation: 
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m 

3 

6 

9 

12 

15 



The average (the naean and median coincide in this case) 
is 9. The deviations are added, taking no account of alge- 
braic signs, and the total divided by the number of items. 
This procedure is described by the expression 

M.D. - 

In general terms, the mean deviation of a series of mag- 
nitudes is the arithmetic mean of their deviations from an 
average value (either mean or median). In the process of 
summation and averaging the algebraic signs of the devi- 
ations are disregarded. In practice it makes little differ- 
ence whether deviations be measured from the mean or 
the median. Theoretically the latter should be chosen, for 
the value of the mean deviation is least when the median 
is the point of reference. 

Table 32 illustrates the computation of the mean devi- 
ation when the data are grouped in a frequency distribu- 
tion.* In this work, as in certain other computations, we 
make the assumption that the items in each class-interval 
are uniformly distributed throughout that interval. 

The median hourly wage of the 4,216 steel workers repre- 
sented in this distribution is 48.11 cents. The mean devia- 


^ Since the uses of the mean deviation are somewhat limited, the beginning 
student may well omit the remainder of the section on the mean deviation. 
After study of the more widely employed standard deviation the student may 
wish to return to the computation of the mean deviation of observations 
grouj)ed in a frequency distribution. 




Table 32* Computation of Mean Deviation 
Average Hourly Earnings of Workers in Open-Hearth Furnaces 
in the Great Lakes and Middle West District in 1933 


Class-interval 
(in cents per 
hour) 

Midr 

point 

Fre- 

quency 

Deviation 

from 

arbitrary 

origin 




m 

f 


fd' 


25.0- 29.9. 

27.5 

12 

20 

240 

c - 0.61 

30.0- 34.9 

32.5 

472 

15 

7,080 

(Median = 48.11 

35.0- 39.9 

37.5 

700 

10 

7,000 

Arbitrary origin = 47.5 

40.0- 44.9 

42.5. 

601 

5 

3,005 

c - 48.11 - 47.5 = 0.61) 

45>0- 49.9 

47.5 

520 

0 

0 


50.0- 54.9 

52.5 

537 

5 

2,685 

A^a = No. of observations in 

.55.0- 59.9 

57.5 

397 

10 

3,970 

classes above that 

. 60.0- 64.9 

62,5 

225 

15 

3,375 

containing the median 

65.0- 69.9 

67.5 

139 

20 

2,780 

2,775 

- 1911 

70.0- 74,9 

72.5 

111 

25 


75.0- 79.9 

77.5 

43 

30 

1,290 

Ah ~ No. of observations in 

80.0- 84.9: 

82.5 

111 

35 

3,885 

classes below that 

85.0- 89.9 

87.5 

74 

40 

2,960 

containing the median 

90.0- 94.9 

92.5 

59 

45 

2,655 

- 1785 

95.0- 99.9 

97.5 

45 

50 

2,250 


100.0-104.9 

102.5 

51 

55 

2,805 Nm — No. of observations in 

105.0-109.9 

107.5 

78 

60 

4,680 

the class-interval con- 

110.0-114.9 

112.5 

6 

65 

390 

taining the median 

115.0-119.9 

117.5 

17 

70 

1,190 

= 520 

120.0-124.9 

122-5 

1 

75 

75 


125.0-129.9 

127.5 

2 

80 

160 

i ~ 5 

130,0-134.9 

132.5 

5 

85 

425 


135.0-139.9 

137.5 

7 

90 

630 

Calculations 

140.0-144.9 

142.5 

1 

95 

95 


145.0-149.9 

147.5 

1 

100 

100 

(1) Sum of deviations from 

150.0-154.9 

152.5 

0 

105 

0 

arbitrary origin of all 

155,0-159.9 

157.5 

1 

4,216 

110 

110 

56,610 

observations in classes 
other than that contain- 
ing the median = 56,610 

Computation of median; 



11 

2,108 




(2) (Mb-Na)c = - 76.86 
/i \ 2 

Md = 

45.0 - 

, i!23 

X (5.0) 1 


(2 + ") 

^ 1 520 


(3) Nn. 

■ ss 

45,0+3.11 




. ■ sss 

48.11 




V2 / - 





+ isr„--— rT-- = 688.67 


Sum of deviations from median « 56,610 — 76.86 

+ 688.67 
» 57,221.81 

MD ~ 

4,216 

« 13.573 ,.:v - 
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tion could be computed directly, with reference to deviations 
from the median, but it is simpler to measure the deviations 
from the midpoint of the class containing the median, and 
then apply corrections to offset the resulting error. 

In Table 32 deviations have been measured not from 
48. 11, the value of the median, but from 47.6, the midpoint 
of the class in which the median falls. Working with thevse 
measurements, the computations involve three steps: 

1. Obtaining the sum of the deviations from the assumed 
median of all items falling in classes other than that con- 
taining the true median. 

2. Correcting this sum for the error involved in the use of 
an origin other than the true median* 

3. Adding to the corrected sum the sum of the deviations 
from the median of the items within the class-interval con- 
taining the median. 

(1) The sum referred to in (1) is obtained directly, in the man- 
ner indicated in Table 32.^ It coint^s to 56,610. 

(2) The four classes below that (‘oritaining the median con- 
tain 1,785 items. The deviation of each of these items from the 
true median, 48. 11, is greater by 0.61 than the deviations actu- 
ally recorded in Table 32, which are meuKSured from 47.5. The 
measured deviations are too small by 0.61 for 1,785 items. The 
22 classes above that containing the median contain 1,91 1 items. 
For each of these the deviation from the true mean, 48.1, is less 
by 0.61 than the deviations actually recorded, which are meas- 
ured from 47-5. The measured deviations are too large by 
0.61 for 1,911 items. Accordingly the figure 56,610, which we 
have obtained as the sum of the deviations from the arbitrary 
origin of all the items in classes other than that coiitaiiiing the 
true median* must be corrected by the addition to it, algebra- 
ically, of + (1,785 X 0.61) and - (1,911 X 0.61). 

* This is not the sum of the deviations from 4.75, the arbitrary origin. For 
no account is taken of the deviations from that value of the 520 items failing 
within the class in question. If these are scattered uniformly throughout the 
dass-int/erva! they will contribute to the total of the deviations from 4.75. 
This would not be so if we were working on the assuniption tiiat all the items 
in a da» are concentrate at the midpoint. In computing the mean devia- 
tion, however, it is neemary to make a different assumption, namely, that of 
uniform distribution throughout the class-interval. 
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The corrections under point (2) may be defined more briefly,^ 
Let iVa = number of items in classes above that containing the 
median, “ number of items in classes below that containing 
tlie median, and c = Md — 0, where Md is the median and 
0 is the arbitrary origin. The quantity c will, of course, be 
poBitive or negative, depending on the relative values of Md 
and 0, and this sign should be retained throughout the calcu- 
lations. The correction noted in ( 2 ) is then given by 

whieb is to be added (algebraically) to the sum referred to in 
(I). In the present instance we have, as the required correction, 


(1,785 -- 1,911) X ( + 0.61) = - 76.86. 


(3) Taking account of point (3) now, we must measure the 
deviations from the median of the 520 observations hitherto 
neglected. These are the observations falling within the class- 
interval that contains the median. This class-interval extends, 
on the .fvseale, from 45.0 to 50.0. The value of the median is 
48.11. If the 520 observations are uniformly distributed be- 
tween 45.0 and 50.0, the number falling between 45.0 and 48. 11 
may be caimputed by the direct proportion 

3.11 . 

77 .- X 520 = 323.4. 

5.0 

Similarly, for the number of observations between 48.11 and 

50 . 0, we have 

1.89 

-777 X 520 = 196.6. 

5 . 0 

On the assumption of uniform distribution, the average deviation 

from the median of the 

and 48.11 is 1.555 ^i.e. 

of .this group from the median, we have 

323.4 X 1.555 = 502.887. 


323 . 4 observations falling between 45 , 0 
3.11\ ^ , 

, I- For the sum of the deviations 


Similarly, the average deviation from the median of the 196.6 

observations falling between 48.11 and 50.0 is .945 ^i.e., 

KX A Handbook of Mathematical StatiMieSf M. L. Eietz, editor, Boston, 
Houghton Mifflin, 1924, m 
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For the sum of the deviations of this group from the median 
we have, 

(196.6) X .945 = 185.787. 

The sum of the deviations from the median of all the observa- 
tions in the class containing the median is 

502.887 + 185.787 = 688.674. 

In more general terms, the correction noted in (3) may be 
defined as follows. We have c = Md — 0; let i = class-inter- 
val and let Nm = number of observations in the class-interval 
in which the median lies. The sign of c must be retained in the 
calculations. For the number of items in that portion of this 
class-interval which falls below the median, we have 



The average deviation of these items from the median is 



2 


The sum of the deviations from the median of the items in this 
segment of the class-interval containing the median is the 
product of these two quantities, or 



For the number of items in that portion of this class-interval 
which lies above the median, we have 



The average deviation of these items from the median is 
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The sum of the deviations from the median of the items in this 
segment of the class-interval containing the median is given by 



Accordingly, the total correction referred to under (3), on p. 142, 
or the sum of the deviations from the median of the items within 
the class-interval containing the median, is 



The nature of these formulas may be made clearer by insertion 
of the values in the example cited above. 


In the final computation of the mean deviation we must 
apply to the sum referred to under (1), on p. 142, the two cor- 
rections noted under (2) and (3) on p. 142. From (1) we have 
56,610; the correction under (2) is — 76.86; the correction 
under (3) is + 688 . 67. The sum of the deviations from the 
median is, therefore, 57,221 . 81. For the mean deviation from 
the median, we have 


M D = 

^ ■ 4,216 


13.573. 


The mean deviation from the mean may be computed by 
an identical process. 

THE STANDARD DEVIATION 

The process of calculating the mean deviation is alge- 
braically illogical because algebraic signs are disregarded. 
In the computation of the standard deviation this error is 
avoided and a measure of more precise mathematical sig- 
nificance is secured. The conventional s5Tnbol for the 
standard deviation is the Greek letter sigma, a. 

In computing this measure the deviations of the indi- 
vidual items from the arithmetic mean are squared, totaled. 
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the mean of the squared deviations obtained, and the square 
root of this mean extracted. The standard deviation is, 
thus, the square root of the mean of the squared deviations. 
This measure is also termed the root-mean-square deviation, 
a useful name because it describes in full the method of 
calculation. The deviations are always measured from the 
arithmetic mean, as the value of the measure is a minimum 
under these conditions. A simple example will illustrate the 
process (Table 33). 


Table 33 

Computation of Standard Deviation 

m f d 

3 1 - 6 

B 1 - .3 

9 1 0 

12 1 +3 

15 1 +6 


When the standard deviation is computed from ungrouped 
data, as in this case, the formula ' is 



When the items are grouped in a frequency distribution 
the task of computation is a little more complicated. The 
measurement of deviations from an arbitrary origin is essen- 
tial in this case, as it greatly simplifies the calculations. 

* This formula is used in statistical descriptiony which is the concern of this 
section of the book, if our purpose is to ust^ results secured from a sample as 
e»i{fmi€9 of the attributes of the population from \vhich the sfxmple has been 
drawn, a slight modification is desirable. It has been shown that, the estimate 
of the true stodard demtion is improved if iV — 1 be used as ttxe divisor in the 
formula, in place of iV. The difference is slight for estimates based on large 
eami^es, important for small ernes. 
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The general formula for the standard deviation is 

where / represents the class-frequencies, d the deviations 
from the arithmetic mean and N the number of cases 
included. It follows that 



If a deviation from an arbitrary origin be represented by 
d' and the root-mean-square deviation from this origin be 
repre.sented by Sa, we have 

, . _ 

“ “ N ' 

The root-mean-square deviation from the mean {a) is less 
than the root-mean-square deviation from any other point 
on the scale. Hence is greater than <r^. We may repre- 
sent by c the difference between the true mean and the 
arbitrary origin. It may be readily established * that 

tr* = sj' — 

The value of the standard deviation may be most easily 
determined, therefore, by computing sj' and c®. The opera- 
tions involved are illustrated in detail in Table 34, showing 
the distribution of 11,404 steel workers, classified on the 
basis of average hourly earnings in 1933. 



but Sd = 0 

N 

.•.2(d')i> = Sd2 +iVc> 

d' = d + c 

S{d'P_Sd= . 

N 

{d'y = d* -t- 2cd + cV 

So* = ff* + C* 

2(d'p - Sd-‘ + 2r£d + 

= So* - C*. 


Table 34 . Computation of Standard Deviation 
Average Hourly Earnings of Workers in Open-Hearth Furnaces 

in 1933 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 

Class- 

Mid- 

point 

(cents) 


Deviation 





interval 

Fre- 

from 





{cents per 
hour) 

quency 

arbitrary 

origin 






, m 

f 

d' 

fd' 

mr 

W'+l)^ 


15.0- 19.9 

17.5 

41 

- 9 

- 369 

3,321 

64 

2,624 

20.0- 24,9 

22.5 

54 

- 8 

- 432 

3,456 

49 

2,646 

25.0- 29.9 

27.5 

342 

- 7 

- 2,394 

16,758 

36 

12,312 

30.0- 34.9 

32.5 

1,158 

- 6 

- 6,948 

41,688 

25 

28,950 

35.0- 39.9 

37.5 

2,103 

-- 5 

- 10,515 

52,575 

16 

33,648 

40.0- 44.9 

42.5 

2,063 

. - 4 

~ 8,252 

33,008 

12,897 

9 

18,567 

45.0- 49.9 

47.5 

1,433 

- 3 

- 4,299 

4 

5,732 

50.0- 54.9 

52 . 5 

1,131 

- 2 

- 2,262 

4,524 

1 

1,131 

55.0- 59.9 

57.5 

775 

- 1 

- 775 

775 

0 

0 

60.0- 64.9 

62,5 

478 

0 

0 

0 

1 

478 

65.0- 69.9 

67.5 

457 

1 

457 

457 

4 

1,828 

70.0- 74.9 

72.5 

304 

2 

608 

1,216 

9 

2,736 

75.0- 79.9 

77.5 

216 

3 

648 

1,944 

16 

3,456 

80.0- 84.9 

82.5 

193 

4 

772 

3,088 

25 

4,825 

85.0- 89.9 

87.5 

117 

5 

585 

2,925 

36 

4,212 

90.0- 94.9 

92.5 

111 

6 

666 

3,996 

■49 

■5,439 

95.0- 99.9 

97.5 

62 

7 

1:31 

3,038 

64 

3,968 

100.0-104.9 

102.5 

71 

8 

568 

4,544 

81 

5,751 

105.0-109.9 

107.5 

103 

9 

927 

8,343 

100 

10,300 

110.0-114.9 

112.5 

34 

10 

340 

3,400 

121 

4,114 

115,0-119.9 

117.5 

58 

11 

638 

7,018 

144 

8,352 

120.0-124.9 

122.5 

27 

12 

324 

3,888 

169 

4,563 

125.0-129.9 

127.5 

19 

13 

247 

3,211 

196 

3,724 

130.0-134.9 

132.5 

19 

14 

266 

3,724 

225 

4,275 

135.0-139.9 

137.5 

14 

15 

210 

3,150 

256 

3,584 

140.0-144.9 

142.5 

12 

16 

192 

3,072 

289 

3,468 

145.0-149.9 

147.5 

2 

17 

34 

578 

324 

648 

150.0-154.9 

152.5 

4 

18 

72 

1,296 

361 

1,444 

155.0-159.9 

157.5 

2 

19 

38 

■ 722 

400 

800 

160.0-164.9 

162.5 

1 

20 

20 

400 

441 

441 

Total 


11,404 


- 28,200 229,012 


184,016 


N - 11,404 

Class-interval — 5.0 cents 

— 28 200 

c (in class-interval units) — — 2 . 4728 

11,404 

(in class-interval units) = 4* 6.1147 

w. 1 . . , X MdV 229,012 

{m class-interval units) ». 20.0817 

0 -s (in class-interval units) » sa^' - « 20.0817 - 6.1147 =« 13.9670 

cr (in class-interval units)- «" 3.737 

o (In nngiim! units) » 3.737 X 5.0 cents « 18,685 cents. 
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The entire calculation, it will be noted, is carried through 
in terms of class-interval units, the result being reduced to 
the original units in the final operation. In computing c, 
the difference between the true mean and the arbitrary 
origin, the algebraic sum of the deviations is divided by the 
number of cases. The arithmetic mean could be deter- 
mined by reducing c to original units and adding this value 
(algebraically) to the value of the arbitrary quantity se- 
lected as origin, but this is not an essential step. The actual 
value of the mean need not be known in the computation of 
the standard deviation. 

A check upon the accuracy of the calculations (the Charlier 
check *) is afforded by the figures in cols. (7) and (8) of 
Table 34. If deviations be measured, not from the arbitrary 
origin employed in computing the standard deviation, but 
from an origin one class-interval below, we secure a set of 
values equal to d' 4- 1 . The squai’es of these values are 
given in col. (7). Multiplying by the corresponding fre- 
quencies we have the quantities recorded in col. (8), the 
sum of which is 184,016. This total stands in a definite 
relationship to the values secured in computing the standard 
deviation. For 

W + 1)“ = 2:/[(d')^ + 2d' -bij 
= S/(d')2 + 2S/d' -1- 2/ 
or S/(d' 4- 1)= = S/(d')2 -f 2S/d' -f N. 

Inserting in this last equation the values secured from 
the calculations shown in Table 34, we obtain this check: 

184,016 = 229,012 -f- 2(- 28,200) -t- 11,404 
= 184,016. 

The following is a summary of the steps in the process of 
computing the standard deviation of items grouped in a 
frequency distribution: 

' Of. C. V. L. Charlier, Vorlesungen Ober Die GrundzUge Der Mathematischen 
iSlQfisdl:, Lund, Verlag Scientia, 1920, 19, 
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1. Select as arbitrary origin the mid-point of a class near the center 

of the distribution. 

2. Measure the deviations from this point of the items in each class, 

in class-interval units. Multiply the deviations by the corre- 
sponding class-frequencies. 

3. Divide the algebraic sum of the deviations by N. This gives c, 

in class-interval units. Compute c®. 

4. Square the deviations and multiply by the corresponding class- 

frequencies. 

5. Divide the sum of the squared deviations by N. This gives sj, 

in class-interval units. 

6. From the formula, cr* = sj — c^, compute <r“. Extract the 

square root of this value, securing <r in class-interval units. 

7. Multiply or, as thus computed, by the class-interv'al. The result 

is cr in the original units of measurement. 

Certain of the characteristics of the standard deviation 
and its relation to other measures of dispersion are described 
in a later section ’ 

THE QUARTILE DEVIATION 

In the chapter on averages methods of locating the quar- 
tiles and deciles were described. The former are those points 
on the scale of values, along which the items of a given 
distribution lie, which divide the total number of items into 
four equal groups. The deciles are those points dividing the 
total number of items into ten equal groups. The degree 
and character of the variation in a frequency distribution 
may be accurately described if the location of the quartiles 
and deciles is shown. Such knowledge, however, while 
helpful in giving a picture of the distribution, is not as use- 
ful for purposes of concise description and comparison as 
knowledge of the values of the mean deviation or the stand- 
ard deviation. The significance of a single measure is more 
readily grasped than is the meaning of a number of inter- 
related values. Such a measure of variation may be com- 
puted from the quartiles, however. With regard to ease of 

^ A con’ection to be applied to the standard deviation in certain cases 
(Sheppard's correction) is described in Chapter XIIL 
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calculatioa and immediate significance this quartile deviation 
has distinct merits. 

Within the range between the two quartiles, of course, 
one half of all the measures are included. The greater the 
concentration the smaller this interval, hence a fairly accu- 
rate measure of dispersion may be obtained from the rela- 
tionship between these two quartiles. The quartile deviation 
is the semi-interquartile range, half the distance along the 
.scale between the first and third quartiles. Thus if Q.D. 
represent the quartile deviation, Qi the first quartile and 
Qi the third quartile, 

Q.D. = Qt. 

If the value of a point on the scale half-way between 
the first and third quartiles is represented by K, one half 
of all the measures in a frequency distribution will fall 
within the range K ± Q.D. For the data in Table 32, 
relating to the hourly earnings in 1933 of steel workers in 
the Great Lakes and Middle West District, we have 

(3» = 39.07 

Qx = 59.03 

Q D = 59. 03 - 39.07 
= 9.98 

K = 39.07 -f- 9.98 
= 49.05. 

Thus one half of all the measures lie within the range 
49.05 i: 9 . 98. This statement, together with a statement 
of the average hourly earnings in 1933 (mean, median, or 
mode), constitutes a useful description of the distribution. 
In a perfectly symmetrical distribution the value of K will 
coincide with the value of the median (that is, the median 
will lie half-way along the scale from 4 to Qz). The dis- 
tribution of wage rates is slightly asymmetrical, the value of 
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the median being 48.11, as compared with the value of 
49.05 for il. 

THE PROBABLE ERROR 

In studying the results of astronomical and other physi- 
cal measurements it has been found that the values secured 
by different observers for the same constant quantity vary. 
These varying results, however, are distributed in a certain 
definite way, and when plotted give a curve similar to the 
normal curve of error. In such cases there is an immediate 
and obvious need of some measme of variation which may 
be used as an index of the reliability of given results. If 
the results secured by different investigators, or by the 
same investigator at different times, vary widely they can- 
not be accepted as reliable, while the reverse is true if the 
variation is slight. The measure of dispersion which has 
been generally employed in such eases is termed the prdb- 
able error. The probable error is that amount which, in a 
given case, is exceeded by i.he errors of one half the ob- 
servations. Since the most probable value of a given series 
of observations is their arithmetic mean, the probable error 
is always measured from the mean. The name of this 
measure derives from the fact that the probability that a 
given observation will vary from the me^n of all the ob- 
servations by an amount greater than the probable error 
is exactly |. It follows that, when the observations are 
arranged in the form of a frequency distribution, a distance 
equal to the probable error laid off on each side of the arith- 
metic mean will define limits within which one half of the 
total number of cases will fall. 

This measure of variation has been employed in fields 
other than that in which it was originally applied, fields in 
which the name probable error is somewhat misleading. In 
such cases it is perhaps better to think of it as the probable 
deviaiion, that distance from the mean which will be ex- 
ceeded by one half of the total deviations. 
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The probable error is a measure of dispersion which is 
fully significant only when it applies to a distribution fol- 
lowing the normal law of error. In such cases it has a 
definite and precise meaning. This is not so when it is 
applied to skew distributions, and its use in such cases 
is not advisable. The quartile deviation, the value of which 
is equal to that of the probable error in a normal distribu- 
tion, has a more direct significance than the probable error 
in the description of abnormal distributions, and should be 
employed in such cases. In a later section the use of the 
probable error as a measure of the reliability of statistical 
results is more fully explained. 

The value of the probable error in a given case, assuming 
a normal distribution to prevail, may be determined from 
the value of the standard deviation, for there is a constant 
relationship between these two. This is expressed by the 
formula: P.E. = 0.6745cr. 

Relations between Different Measures of 
Variation 

An understanding of the significance of the various meas- 
ures of dispersion described above may be facilitated by a 
general comparison and a summary statement of the rela- 
tions holding between them. 

1. The range is a distance along the scale within which all the 

' observations lie. 

2. The quartile deviation or semi-interquartile range is a distance 

along the scale which, when laid off on each side of the point 
midway between the two quartiles, includes one half the total 
number of observations. 

3. The mean deviation from the mean, in a normal or slightly 

skew distribution, is equal to about f of the standard devia- 
tion. A range of 7-J- times the mean deviation, centering at the 
mean, will include approximately 99 per cent of all the cases. 

4. When a distance equal to the standard deviation is laid off on 

each side of the mean, in a normal or only slightly skew dis- 
tribution, about two thirds of all the cases will be included. 
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(In the normal distribution exactly 68.26 per cent of the obser» 
vations will be included.) When a distance equal to twice the 
standard deviation is laid off on each side of the mean approxi- 
mately 95 per cent of the cases will be included (exactly 95.46 
per cent in a normal distribution). When a distance equal to 
three times the standard deviation is laid off on each side of 
the mean about 99 per cent of all the observations will be 
included (exactly 99.73 per cent in a normal distribution). 
This general rule that a range of six times the standard devia- 
tion, centering at the mean, will include about 99 per cent of 
all the measures furnishes a useful check upon calculations. 

A study of Fig. 45 may help to make clear the significance of 
the standard deviation in a normal distribution. 

5. The probable error ^ in a normal distribution, is equal to O.67450-. 
A range of twice the probable error, centering at the mean, 
will include 50 per cent of all the observations. A range of 
eight times the probable error, centering at the mean, will 
include approximately 99 per cent of all the observations. 

Characteristic Features op the Chief Measures 
OF Variation 

The range 

1. The range is easily calculated and its significance is readily 

understood. As a rough measurt' of tlie degree of variation 
the range is Uvseful. 

2. The value of the range is determined by the values of the t-wo 

extreme eases. It is thus a higlily unsta-l)le measure, the 
value of which may be greatly changed by the addition or 
withdrawal of a single figure. 

3. This measure gives no indication of the cliaracter of the distri- 

bution within the two extreme observations. 

The qnaritle deviation 

1. The quartile deviation is a mcasun‘ of dispersion that is eavsily 

computed and readily understood. It is superior to the range 
as a rough measure of variation. 

2. The quartile deviation is not a measure of the variation from 

any specific average. 

3. This measure is not affected by the distribution of the items 

between the first and third quartiles, or by tlie distribution 
outside the quartiles. The values of the quartile deviation 
might be the same for two quite dissimilar distributions, pro- 
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\dded the quartiles happened to coincide. Because it is not 
affected by the deviations of individual items it cannot be 
accepted as an accurate measure of variation* 

4. The quartile deviation is not suited to algebraic treatment. 

The inean deviation 

1. The mean deviation is affected by the value of every observa- 

tion. As the average difference between the individual items 
and the median (or mean) of the distribution it has a precise 

significance. 

2. The mean deviation is less affected by extreme deviations than 

the vStandard deviation. 

3. Matlieinatically, the mean deviation is not as logical or as con- 

vc'iiient a measure of divspersion as the standard deviation. 

The standard devmtion 

L The standard deviation is affected by the value of every ob- 
servation. 

2. The process of squaring the deviations before adding avoids the 

algebraic fallacy of disregarding signs. 

3. The standard deviation has a definite mathematical meaning 

and is perfectly adapted to algebraic treatment. 

4. Th(* standard deviation is, in general, less affected by fluctua- 

tions of sampling than the other measures of dispersion, 

6. The normal curve of error has been analyzed in terms of the 
standard deviation. The information thus obtained has 
increased greatly the utility of the standard deviation. 

The probable error 

1. The probable error has a- definite meaning in the case of a dis- 

tribution following the normal law. It has not this precise 
meaning for other distributions, and should not be employed 
in describing them. 

2. For distributions to which it is adapted, the probable error is an 

extremely useful measure. Its most important use is as an 
index of the magnitude of errors of sampling, 

3. The definite relationship between the probable error and the 

standard deviation, for a normal distribution, permits the 
value of the probable error to be readily determined. 

All the measures of variation described above may be 
utilized for particular purposes. The standard deviation, 
however, is the best general measure and should be em- 
ployed in all cases where a high degree of accuracy is re- 
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quired. The probable error is, in effect, merely a fractional 
part of the standard deviation, with a definite but restricted 
field of usefulness. 

The Meastjeement of Relative Vaeiation 

We have been dealing in the preceding section with 
absolute variability. The various measures of dispersion 
secured by the methods outlined describe the variability 
of the data in terms of absolute units of measurement. 
The standard deviation of London-Paris exchange rates is 
in francs, the standard deviation of pig iron production in 
tons, etc. If the object in a given case is the description of 
a single frequency distribution it is desirable that the orig- 
inal unit be employed throughout, but if measures of varia- 
tion of two different distributions are to be compared, diffi- 
culties are encountered. This is clear if the units are unlike, 
but even if the units are identical the same difficulty arises. 
Thus measures of variation in the weights of dogs and in the 
weights of horses might both have been computed in pounds. 
Because the standard deviation of horse weights is greater 
than the standard deviation of dog weights, it does not fol- 
low that the degree of variability is greater in the former 
case. A measure of absolute variation is significant only in 
relation to the average from which the deviations are meas- 
ured. Its use, apart from this average, is meaningless. For 
comparison, therefore, it must be reduced to a relative form, 
and the ob\Tous procedure is to express a given measure 
of variation as a percentage of the average from which the 
deviations have been measured. The quantity thus becomes 
an abstract number, a measure of the relative variability 
of the given observations, and may be compared with similar 
terms computed from other distributions. 

THE COEFFICIENT OF VAEIATION 

The measure of relative variation most commonly em- 
ployed is that developed by Pearson, termed the coefficient 
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of TOnafo'on, and represented by the letter V. It is simply 
the standard deviation as a percentage of the arithmetic 
mean. Thus 

F = ^X100. 

M 

Applying this formula to the results secured from the an- 
alysis of the distribution of steel workers, classified accord- 
ing to hourly earnings in 1933 (Table 34), we have 


18.685 


50.136 
= 37.27% 


X 100 


This measurement may be compared with a similar coeffi- 
cient relating to the distribution of workers in open-hearth 
furnaces, classified according to average hourly earnings in 
1935. In that year the mean wage was 71 . 946 cents and the 
standard deviation 28 . 55 cents. From these 


28.55 
71” 946 


X 100 


= 39.68%. 


Variations of hourly eaimings among steel workers was 
greater in 1935 than in 1933. The difference was not as 
great, however, as a comparison of standard deviations would 
indicate. The average wage advanced appreciably between 
1933 and 1935 and the relative variation increased only 
moderately. 

An index of variability similar to this coefficient might 
be secured by expressing any of the other measures of 
deviation as a percentage of the average from which the 
deviations were computed. Pearson’s coefficient has been 
generally adopted, however, and is the only one in wide use. 

Mbasitkes of Skewness 

Methods have been developed in the preceding sections 
for describing the central tendency of a frequency distri- 
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bution 9 ,nd for measuring the degree of concentration or 
lack of concentration about that central tendency. One 
further measure is needed, and that is one which indicates 
the degree of skewness or asymmetry of a given distri- 
bution. For it is essential to know, in regard to a given 
distribution, whether the observations are arranged sym- 
metrically about the central value, or are dispersed in an 
uneven, asymmetrical fashion about that value. Having 
such a figure it will be possible effectively to summarize 
the characteristics of a frequency distribution in three sim- 
ple terms — an average, a measure of dispersion and a 
measure of skewness. There are two measures of skewness 
in current use. 

If a frequency curve is perfectly symmetrical, mean, 
median, and mode will coincide. As the distribution de- 
parts from symmeti-y these three values are pulled apart, 
the difference between the mean and the mode being great- 
est. This difference may be used, therefore, as a measure 
of skewness. It is desirable in this case, as in measuring 
relative variability, to secure an index in the form of an 
abstract number, which may be compared with similar fig- 
ures derived from other distributions. To this end, Pearson 
has proposed dividing the absolute difference between mean 
and mode by the standard deviation of the given distribu- 
tion. His formula is 

, , , , M - Mo 

sk (skewness) 

cr 

In a symmetrical distribution, where mean and mode coin- 
cide, the value of this measure will be zero. Under other 
conditions the value may be positive or negative, depending 
upon the relative positions of the two averages on the scale. 

For moderately skew distributions the degree of skew- 
ness may be computed more readily from the formula 

,k = - Md) 

or 
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This corresponds approximately to the other formula, be- 
cause of the fact that in a moderately asinnmetrical distri- 
bution the median lies between the mean and the mode, 
about one third of the distance from the former towards 
the latter. 

Because it is difficult to locate the mode by simple meth- 
ods, a measure of skewness more easily computed than 
Pearson’s is desirable in some cases. Bowley has proposed 
such a method, based upon the relationship between the 
first and third quartiles and the median. If the distribution 
is symmetrical these two quartiles will be equidistant from 
the median; with an as 3 Tnmetrical distribution this is not 
so. Therefore, if we let represent the difference between 
the upper quartile and the median and qi represent the 
difference between the median and the lower quartile, we 
may use the formula 

sk = 

q^ + Qi 

as a means of securing a measure of skewness. This value 
will vary between 0 and ± 1. For with perfect symmetry 
g 2 = qu and the measure is 0; with asymmetry so pro- 
nounced that the median and one of the quartiles coincide, 
either q 2 or qi becomes equal to 0, and the formula gives 
a value of -f 1 or — 1. Bowley suggests that a value of .1 
indicates a moderate degree of skewness, while a value of 
.3 indicates marked skewness. 

The values secured from this measure are not, of course, 
comparable with the values secured from the application 
of Pearson’s formula for measuring skewness. 

KTTRTOSIS 

Reference has been made to a fourth measurable char- 
acteristic of frequency curves. This is the degree of flat- 
toppedness, as compared with the normal curve. A measure 
of kurtosis, the technical term for this characteristic, is 
given in Chapter XIII. 
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CHAPTER VI 


INDEX NUMBERS OF PRICES 
The Nature op Index Numbers 

The term “index number” has been applied to a number 
of somewhat similar devices employed in the analysis of 
statistical series. Index numbers have been most widely 
used in the study of price changes, but a brief considera- 
tion of certain other uses may make clear the essential 
characteristics of such measures. In its simplest form this 
name is applied to a term in a time series expressed as a 
relative number. Thus an index munber of cotton consump- 
tion in the United States might take the following form; 

Table 35 


Dome die Cotton Consumption in the United States, 
1926-1936 

(Consumption in year ended July 31, 1926 = 100) 


Year ended 
Jul y 81 

Cotton, consumption 
(unit: one thousafid 

Cotton co7isumption 
relative 

running bales) 

1926 

6,456 

100 

1927 

7,190 

111 

1928 

6,834 

106 

1929 

7,091 

110 

1930 

6,106 

95 

1931 

5,263 

82 

1932 

4,866 

75 

1933 

6,137 

95 

1934 

5,700 

88 

1935 

5,361 

83 

, 1936 

6,351 

98 


Similarly the price of a commodity may be expressed as 
a relative, the price at a given date or for a given period 
serving as base. ■ 
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Table 36 

Average Price of No. 1 Northern Spring Wheat, Minneapolis 
1913, 1929-1936 

(Average price in year ended June 30, 1913 = 100) 


Calendar 

Weighted average 

Relative 

year 

price per bushel 

' price 

1913 

10.874 

100 

1929 

1.276 

146 

1930 

0.984 

113 

1931 

0.739 

85 

1932 

0.605 

69 

1933 

0.770 

88 

1934 

1.026 

117 

1935 

L165 

133 

1936 

' 1.247 ■ 

^ 143 ' 


The representation of the terms in a time series as rela- 
tives, with reference to a fixed base, makes possible a ready 
comparison of the values for different dates and enables one 
to follow the trend of the series much more easily than 
when the data are presented in their original form. Compari- 
son of the trends of different series is also facilitated. 

Though the term index number has been applied to such 
relatives it is better practice to reserve the term for figures 
which represent the combination of a number of series. 
The series to be combined may relate to prices, production, 
consumption, wages, volume of trade, or to any factor sub- 
ject to temporal variation. (Index numbers have been used 
also in measuring such geographical differences as arise 
from variations in living costs from city to city or from 
country to country.) Quite complex problems may be in- 
volved in the construction of any one of these special forms 
of index numbers, but the e.ssential aim in all cases is to 
secure a single, simple series that will define the net resultants 
of the changes occurring in the constituent elements. 

A simple index number may be constructed to represent 
the course of coal and petroleum production in the United 
States. In the making of such an index it is necessary to 
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combine in some way production figures for bituminous 
and anthracite coal and petroleum. The production figures 
and the corresponding relatives for the three series, from 
1922 to 1936, are given in Table 37. 


Table 37 


Production of Bituminous and Anthracite Coal and Petroleum 
in the United States, 1922-1936 
(Production in 1922 = 100) 



Prof/, of 


Prod, of 


Prod, of 


Year 

bit eoal 
(millmi 

ReL 

cinihr. coal 
{million. 

ReL 

petrol. 

{million 

ReL 


sh. tojis) 


sh. tons) 


bbls.) 


1922 

422.3 

100 

54.7 

100 

557.5 

100 

1923 

564.6 

134 

93.3 

171 

732.4 

131 

1924 

483.7 

115 

87.9 

161 

713.9 

128 

1925 

520.1 

123 

61.8 

113 

763.7 

137 

1926 

573.4 

136 

84.4 

154 

770.9 

138 

1927 

517.8 

123 

80.1 

146 

901.1 

162 

1928 

500.7 

119 

75.3 

138 

901.5 

162 

1929 

535.0 

127 

73.8 

135 

1 , 007.3 

181 

1930 

467.5 

111 

69.4 

127 

898.0 

161 

1931 

382. 1 

90 

59.6 

109 

851.1 

153 

1932 

309.7 

73 

49.9 

91 

785.2 

141 

1933 

333.6 

79 

49.5 

90 

905.7 

162 

1934 

359.4 

85 

57.2 

105 

908.1 

163 

1935 

372. 4 

88 

52.2 

95 

996.6 

179 

1936 

434. 1 

103 

54.8 

100 

1 , 098.5 

197 


A rough index of fuel production, based upon these three 
series, is desired. It is impossible, obviously, to add the 
original figures, as the units are not the same. This diffi- 
culty may be avoided by using the relative figures. A simple 
average of the three relatives for a given year may serve 
as the required index. Index numbers thus secured are 
given in Table 38 on page 164. 

In securing this index, by adding the three relative fig- 
ures for a given year and dividing by three, equal weight 
has been given to each of the three series. Such an index of 
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Table 38 

Index Numbers of Coal and Petroleum Production in the 
United States, 1922-1936 
(Production in 1922 = 100) 


Year 

Index 

Year 

Index 

1922 

100 

1930 

133 

1923 

145 

1931 

117 

1924 

135 

1932 

102 

1925 

124 

1933 

110 

1926 

143 

1934 

118 

1927 

144 

1935 

121 

1928 

140 

1936 

133 

1929 

148 




equally weighted relatives has been termed an unweighted 
index, but the term is misleading. Weights are used, the 
weights in this case being equal. It is clear that this index 
based upon equal weights does not reflect faithfully the 
throe series combined in the present instance. For the three 
series are not of equal importance, as the system of equal 
weights assumes. The following figures showing the whole- 
sale values in exchange in 1926 of bituminous coal, anthra- 
cite coal, and crude petroleum indicate the relative im- 
portance of the three series: ^ 


Mineral 

Bituminous coal 
Anthracite coal 
Petroleum 


Wholesale valve in 
exchange in 1926 

$2,1.57,740,000 

888,141,000 

1,355,989,000 


Roughly, these stand to one another in the relation of 
5, 2, and 3, and these weights may be assigned to the series 
under consideration. An index for each year may be com- 
puted, using these weights. The example in Table 39, 
showing the calculations for the years 1922 and 1923, will 
illustrate the method. 


• The figures have been compiled by the U. S. Bureau of Lalwr Statistics. 
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Table 39 


Computation of Weighted Index Numbers of Coal 
and Petroleum Production 


Mimral 

Relative 

production 

1922 

wt. 

Wt. X Rel. 

Relative 

production 

1923 

Wt. 

wt. X M. 

Bitiiminous coal 

100 1 

5 

500 

134 

5 

670 

Anthracite coal 

100 

2 

200 

171 

2 

342 

Petroleiini 

100 

3 

300 

131 

3 

393 



10 

1,000 


10 j 

1,405 


Index of fuel production, 1922 = 1,000 -i- 10 = 100 
Index of fuel production, 1923 = 1,405 10 = 141 


The value of the index thus secured for each of the fif- 
teen years covered is shown in Table 40. 

Table 40 

Weighted Index Numbers of Coal and Petroleum 


ProducMon in the United Stales, 1922-1936 


Fear 

Index 

Fear 

Index 

1922 

100 

1930 

129 

1923 

141 

1931 

113 

1924 

128 

1932 

97 

1925 

125 

1933 

106 

1926 

140 

1934 

112 

1927 

139 

1935 

117 

1928 

136 

1936 

131 

1929 145 

Differences between the two series of index numbers 


to be expected. The second series, which is the more log- 
ically weighted, is, of course, the naore accurate of the two, 
and gives a more faithful representation of the combined 
effect of the forces affecting the output of coal and petroleum. 

Another type of index number is one in which the items 
in the constituent series are tcrtaled, the aggregate figure, 
instead of an average, serving as the representative of the 
entire group. Such a form of index number may be con- 
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structed only when the different series are all expressed in 
the same unit. This form is frequently employed as an 
indication of changes in the level of prices, the aggregate 
cost of a bill of goods at one period being compared with 
the aggregate cost of the same goods at other dates. The 
figures in Table 41 illustrate this type of index. 

Table 41 


Bradstreet’s Index of Wholesale Prices in the 
United States, 1926-1937 * 


Year 

Index 

Year 

Index 

1926 

13.02 

1932 

7.10 

1927 

12.78 

1933 

7.86 

1928 

13.28 

1934 

9.22 

1929 

12.67 

1935 

9.92 

1930 

10.75 

1936 

10.10 

1931 

8.76 

1937 

11.06 


Each of the yearly aggregates quoted above is the sum 
of the average prices during the year of 96 commodities at 
wholesale. Before being added all the prices are reduced to 
the “per pound” basis, so that a certain degree of compara- 
bility is secured. Such an index may be readily changed to 
the relative form, any year being taken as a base and the 
totals for the other years expressed as percentages of the 
figure for the base year. 

The examples which have been given will indicate some 
of the many forms which index numbers may take. The 
term may refer to a simple relative number; it may be 
applied to an average of relative terms, or to an aggregate 
of relative or absolute figures. In all the examples given 
the index has been designed to serve as a measure of change 
over a period, as an indicator of changes in the values of 
time series. The term may have a much broader meaning 
than this. An index of the ability of salesmen might be 
constructed by giving numerical values to the factors deter- 

^ Cbristmetion of this index was discontinued at tlie end of 1937. 
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mining their usefulness and securing an average of these 
values. An index of the efficiency of different departments 
in a business enterprise might be constructed. In any case, 
the construction of an index involves the reduction to com- 
parable terms of a number of different factors and the 
replacement of these several terms by a single figure which 
may serve as their representative. Comparison is thus 
facilitated, whether it be comparison over time or space, or 
comparison with other indices secured by averaging terms 
relating to a similar unit. In all its forms (except the first 
limited and exceptional meaning in which it applies to a 
simple relative) an index number is thus a type of statistical 
average, and such numbers, in their construction and use, 
are subject to all the rules and limitations set forth in the 
development of the subject of averages. 

In the present work we are interested only in the applica- 
tion of the index number device to time series. So varied, 
however, are the rules and practices relating to its applica- 
tion to different types of time series that certain of these 
types must be treated separately. Our first concern is with 
index numbers of wholesale prices. 

Price Changes 

I\Tien price movements are surveyed in detail it is diffi- 
cult to perceive order, or any definite trend. We find a mul- 
tiplicity of conflicting movements. The price quotations 
in Table 42 (on page 168), taken at random, are roughly 
typical of what would be found were the entire field of prices 
canvassed in order to compare price movements from month 
to month. 

Of the sixteen commodities listed, five showed no price 
change at all between October and November, 1937, two 
showed price increases, and in nine cases prices declined. 
Some of the price movements were inconsiderable, while 
some marked very material changes. Such, as seen here 
in miniature, is what happens in the price system as a whole. 
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Table 42 


Cmnmodity Prices at Wholesale ^ 




Price 

Price 

Commodity 

Unit 

(wholesale) 

(wholesale) 



October^ 

November^ 



1937 

1937 

Briek, common building, aver- 




age of yard prices 

1,000 

$12,113 

$12,113 

Pig iron, basic, Valley furnace 
Cement, Portland, average of 

Gross ton 

23.500 

23.500 

plant prices 

Bbl. 

1.667 

1.667 

linseed oil, raw, N. Y. 

Pound 

.110 

. 106 

Steel billets, rerolling, Pitts. 

Gross ton 

37.000 

37.000 

Steel, scrap, Chicago 

Gross ton 

14.688 

12.500 

Copper, electro!., refinery 

Pound 

.119 

. 108 

Lead, pig, N. Y. 

Pound 

.058 

.051 

Zinc, pig, N. Y. 

Coal, anthr., chestnut, average 

Pound 

.065 

.060 

of 15 price series, on tracks, 
destination 

Net ton 

9.472 

9.610 

Coal, bit., mine run, average of 




27 price series, on tracks, des- 
tination 

Net ton 

4.305 

4.303 

Crude petroleum, Penn., at wells Bbl. 

2.413 

2.350 

Gasoline, motor, California, re- 




finery 

Gal. 

.083 

.085 

Cotton, middling, N. 0. 

Pound 

.083 

.080 

Wheat, no. 2 red winter, Chi. 

Bu. 

1.033 

.951 

Sugar, granulated, N. Y. 

Pound 

.048 

.048 


All prices do not, with absolute uniformity, move up or 
down, or remain constant. Each of the thousands of com- 
modities traded in on the markets of any country, or of the 
world, moves in its own individual w^ay, subject to a variety 
of influences. Yet it does not act in isolation. In its price 
movements it affects other commodities, and is affected 
by them. And, in addition to the forces peculiar to each 
commodity, there are broad forces which act throughout 
the price system, affecting all conamodities. It is the busi- 

^ An compiled by the U. 8 . Buieau of I^abor Statistics. 
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ness of the economic statistician to bring order out of the 
chaos of price movements taking place at any given time 
and, out of the multiplicity of minor movements, to pick 
the' broad trends which affect the whole economic sys- 
tem. 

The forces bringing about the price movements that are 
to be studied are numerous and complicated, but some 
general conclusions may be drawn with regard to them. 
There are, in the first place, all those changes in production 
and consumption conditions peculiar to indmdual commodi- 
ties and affecting directly the prices of those commodities. 
The opening of new fields, improvements in production 
technique in individual cases, changes in fashion and the 
transfer of demand from some commodities to others, changes 
in demand and supply with the seasons — all these are 
causing constant price readjustments. These are the changes 
which in ordinary times are most obvious, which are brought 
home directly to the individual merchant or consumer. 
Such changes affect the whole price system, as has been 
pointed out, but not in general by causing upward or down- 
ward movements in the system as a whole. 

These general movements are due to forces that are 
broader in their scope. The general improvement in pro- 
duction technique and the increase in the productivity of 
human labor which has resulted have, by increasing the 
supply of commodities available for consumption, affected 
prices. Changes in monetary systems and, in particular, 
changes in the gold supply have exerted a direct and imme- 
diate influence upon prices, by affecting the supply of money 
in circulation. Similar in character have been changes in 
banking and credit systems and changes in commercial 
practice that have affected the use of credit instruments 
and the rapidity of circulation of money and credits. All 
these forces influence prices, though their incidence is not 
so specific as are those of the factors affecting individual com- 
modities directly. . 
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PtTEPOSB OF GeNEEAL InDEX NuMBEES OP 
Wholesale Peicbs 

These separate forces cannot be isolated and evaluated. 
Their joint action causes a perplexing variety of price 
changes. In studying these changes the problem might be 
approached from several different points of view. It might 
be desired to study the readjustments that take place within 
the price system, to determine the nature and degree of the 
shifts within the system that come with changing conditions. 
Such a study would yield valuable information as to the 
behavior of prices and the character of their interrelations. 
Our immediate problem, however, is the determination of 
the net resultant of all these forces. Do all price movements 
cancel each other so that while some prices move up and 
some down there is no net change? Or is there at a given 
time a preponderance of movement s in one direction, causing 
the level of general prices to move upward or downward? 
If there is such a trend, what is it, and how may it be meas- 
ured? Are the statistical methods that have been explained 
in the earlier sections applicable to the solution of this 
problem? 

The first step in this study involves the answering of the 
last question asked. It has been brought out that methods 
of summarizing quantitative data have been developed, 
but that these methods are applicable only when certain 
conditions are fulfilled. An average, it was noted, has no 
significance unless it represents a distinct central tendency 
in a mass of homogeneous data. Moreover, the type of 
average to be employed depends upon the character of the 
distribution it is to represent. Until the distribution of 
the original data is studied no average or other statistical 
measure can be intelligently employed. We must first, 
then, determine what the raw materials of the problem are, 
and study the frequency distributions secured when these 
raw materials are organized. 
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For the present a quite general purpose will be assumed, 
the deter m ination of the change in the level of general 
wholesale prices between two specific dates. This is equiva- 
lent, of course, to measuring the change in the purchasing 
power of money in wholesale markets. The raw materials 
of the problem consist of a number of price quotations on 
individual commodities, quotations being secured for the 
two dates to be compared. Each pair of quotations meas- 
ures the change in the price of a single commodity, a change 
caused by the interplay of many forces. When a great 
many such price quotations are brought together we have 
a mass of data representing the interaction of a multitude 
of forces, some individual and specific in their incidence, 
some general, affecting the prices of large groups of com- 
modities or of all commodities. What we seek to determine 
is t he net resultant of all these factors. We seek a measure 
of the composite effect of the numerous forces that are 
causing individual prices to rise or fall. This measure will 
constitut(^ an index number of wholesale prices. 

The unit with which we must deal is a single price varia- 
tion. Whether the statistical methods with which we are 
familiar may be employed in the organization and analysis 
of a number of such units depends upon the behavior of 
such units in mass. The following examples illustrate the 
frequency distributions secured when these data are clas- 
sified. 

Frequency Distributions of Price Ratios 

Each price variation is, of course, a ratio, the ratio of the 
price of a commodity at a given date to the price of the 
commodity at another date. The ratios may be reduced 
to a comparable basis by putting them all in the form of 
relatives, of the type illustrated in the earlier examples of 
index numbers. Thus, using one of the pairs of price quo- 
tations given above, the ratio of the price of steel scrap 
in November, 1937, to the price in October, 1937, is 
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$12,500: $14,688, which, in the form of a relative, becomes 
85.1: 100. In constructing the following frequency table, the 
prices at wholesale in 1927 of 670 commodities were ex- 
pressed as relatives, with the 1926 price as a base in each 
case. The distribution of these 670 relative numbers is 
shown in Table 43. 

Table 43 


Distribution of the Relative Prices of 670 Commodities in 1927 '■ 



(Average prices in 1926 = 100) 


Relative prices 

Mid-point 

m 

No. of cases 

f 

Percentage of total 
number of cases 

52.5- 57.4 

55 

1 

.1 

57.5- 62.4 

60 

2 

.3 

62.5- 67.4 

65 

6 

.9 

67.5- 72.4 

70 

7 

1.0 

72.5- 77 A 

75 

8 

1.2 

77.5- S2.4 

80 

25 

3.7 

82.5- 87.4 

85 

50 

7.5 

87.5- 92.4 

90 

76 

11.3 

92.5- 97.4 

95 

136 

20.3 

97.5-102.4 

100 

196 

29.3 

102.5-107.4 

105 

83 

12.4 

107.5-112.4 

110 

26 

3.9 

112.5-117.4 

115 

16 

2.4 : 

117.5-122,4 

120 

14 

2.1 

122.5-127.4 

125 

12 

1.8 

127.5-132.4 

130 

2 

.3 

132.5-137.4 

135 

3 

.5 . 

137.5-142.4 

140 

5 

.8' ■■ 

142.5-147.4 

145 

1 

■ .1. 

147.5-152.4 

150 



152.5-157.4 

155 

1 

,1 '■ 



670 

, 100.0 


The frequency polygon representing this distribution ap- 
pears in Fig. 49. For purposes of comparison with similar 
distributions the figure shows the percentage distribution. 

* The 670 commodities included were those employed by the IT. S. Bureau 
of I.abor Statistics in the construction of its index of wholesale prices. The 
original figures, and the relatives, appear in Hulletm ff7S, of that Bureau, on 
“Wholesale Prices, 1913-1927.” 
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The correspondence of this frequency distribution to the 
standard types portrayed in earlier sections is obvious. 
There is the same marked concentration about a central 
tendency, in this case a tendency of prices to remain stable, 
for 29 per cent of all the cases showed a change not exceed- 
ing 2.5 per cent from their prices in the base year. There 
is also, in this case, a fairly symmetrical distribution about 
this central tendency, though the range above the mode is 



50 ~ 60 70 80 90 100 110 120 130 140 150 


Relative Price 

Fig. 49. — Frequeiiey Polygon: Distribution of Relative Prices of 
670 Commodities in 1927 (Average prices in 1926 = 100) 

slightly greater than the range below. Without at present 
considering the question as to which average might best be 
used to represent the central tendency in this distribution, 
it is apparent that the use of some average is quite legitimate. 

The example just given has been based upon price varia- 
tions from one year to the next, over a period during which 
the level of general prices declined slightly (4.6 per cent). 
W. C. Mitchell gives a much more comprehensive illustra- 
tion, based upon the distribution of 5,578 price variations 
from one year to the next over the period 1890-1913, which 
shows the same general grouping. The excess of the range 
above the mode over the range below is somewhat more 
pronounced, in connection with which fact it should be 
noted that prices were rising during most of the 23 years 
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covered. The distribution secxxred by Mitchell is shown in 
Fig. 42. 

The inertia of prices is most conspicuous when year-to- 
year price changes are studied. It is therefore advisable to 
consider the character of price variations over a longer 
period, that we may learn whether the same type of dis- 
tribution is secured. Two examples are given, one of price 
changes over a seven-year period, marked by a considerable 



7.5 17.5 27.5 37.5 47.5 57.5 67.5 77.5 87.5 97.5 107.5 117.5 
Relative Price 


Fig. 50. — Frequency Polygon: Distribution of Relative Prices of 
774 Commodities in 1933 (Average prices in 1926 = 100) 

dechne in prices, the other of price changes over a five-year 
period characterized by rapidly rising prices. The table 
following shows the distribution of 774 price variations, 
prices in 1933 being expressed as relatives on a 1926 base. 
The general level of wholesale prices, it should be noted, 
declined some 33 per cent from 1926 to 1933. 

The data in Table 44 are plotted in the form of a frequency 
polygon in Fig. 50, the percentage distribution being shown. 
It vidll be noted that the distribution is curtailed, the five 
upper classes being omitted. 

The distributions depicted in Figs. 49 and 50 differ ma- 
terially. The range of the variations is greater in the second 
case, a condition naturally to be expected because of the 
longer period covered. Secondly, a very much smaller per- 
centage of cases is concentrated in the modal group, though 
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Table 44 


Distribution of Relative Prices of 774 Commodities in 1933 

(Average prices in 1926=100) 

Riiatim' prices 

Mid-point 

No. of cases 

Percentage of total 

m 

f 

number of cases 

, ' iCK 14.9 ■ 

15 - 19 . .9 

12.5 

17.5 

3 

A 

20 - 24.9 

22.5 

1 

.1 

25 - 29.9 

27 . 5 ' 

7 

.9 

. 30 - 34.9 

32.5 

13 

1.7 

35 - 39.9 

37.5 

24 

3.1 

40 - 44.9 

42.5 

28 

3.6 

45 - 49.9 

47 . 5 

51 

6.6 

50 - 54.9 

52.5 

49 

6.3 

55 59.9 

57 . 5 

50 

6.5 

60 64.9 

62.5 

62 

^ 8.0 

65 - 69.9 

67 , 5 

58 

7.5 

70 ’ 74.9 

72.5 

93 

12.0 

75 - 79.9 

77 . 5 

81 

10.5 

80 - 84.9 

82.5 . 

62 

8.0 

85 - 89.9 

87.5 

67 

8.7 

90 - 94.9 

92.5 

40 

5.2 

95 - 99.9 

97.5 

27 

3.5 

100 - 104.9 

102.5 

27 

3.5 

105 - 109.9 

107.5 

11 

1.4 

110 - 114.9 

112.5 

6 

.8 

115 - 119.9 

117.5 

8 

1.0 

120 - 124.9 

122.5 

1 

.1 

125 - 129.9 

127 „ 5 

2 

.3 

155 - 159,9 

157,5 

1 

.1 

180-184 9 

182 , 5 

1 

J 

190 - 194.9 

192.5 

J. 

.1 


774 100.0 


there is still a pronounced central tendency. Both distribu- 
tions, as plotted on the arithmetic scale, are faWy symmetri- 
cal, though a few extreme cases extend the actual upper limit 
of the second distribution. In Fig. 49 the concentration about 
the central tendency is much more marked, and the devia- 
tions of individual price ratios from the central tendency 
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are smaller. This distribution resembles one which would 
be secured from highly accurate physical measurements, or 
the distribution of shots from a very accurate piece of artil- 
lery. The second curve corresponds to one representing less 
accurate physical measurements, or to the distribution of 
shots from an old or inaccurate field piece. The modal 
value occurs less frequently and the deviations from the 
cen|;ral tendency are greater. It has been established that 
the longer the period covered in price comparisons such as 
those made above, the more pronounced is the tendency 
shown in the second curve. The value of the maximum 
ordinate falls and the range of the distribution increases. 
The curve becomes flatter and more extended as the time 
interval increases. And, quite obviously, as this process 
goes on the representative character of any type of average 
declines. Unless there is concentration about a central 
tendency an average is merely an abstraction, without con- 
crete significance. 

It is possible at this point to state as a tentative conclu- 
sion that price variations are capable of statistical measure- 
ment, that they may be represented appropriately by an 
average value, provided the period covered is not too long. 
No definite statement can be made as to the maximum 
period over which price variations may be measured. Index 
numbers having accurate and significant values must be 
based upon comparisons over relatively short periods, the 
most accurate being year-to-year comparisons. Index num- 
bers designed merely to show general trends in prices may 
cover longer periods, though the makers and users of such 
index numbers should realize their limitations.^ 

As a final example we may note the distribution of the 
relative prices of 1,437 commodities in 1918, average prices 
during the period July, 1913 to June, 1914 serving as base.® 

’ Cf. W. C. Mitchell, “The Making and Using of Index Numbers,” BnUe- 
lin S84 (Wholesale Price Series), U. S. Bureau of Labor Statistics. 

* Data compiled by the Price* Section of the War Industries Board; repro- 
duced in Part I, Btdletin S84, U- S. Bureau of Labor Statistics, 70. 
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This was a period marked by rapidly rising prices. In con- 
sulting the graph (Fig. 51) it should be noted that the scales 
are not the same as those employed in the two figures pre- 
ceding. 

A study of this distribution bears out the conclusion 
reached from the two examples preceding. There is a central 
tendency sufficiently pronounced to be well represented by 
an average. In this case, moreover, the modal group is 



Fig. Frequency Polygon: Distribution of Relative Prices of 1,437 
Conmiodities in 1918 (Average prices .July 1913 to June 1914 = 100) 


that with a mid-point of 180, so that the tendency toward 
concentration cannot be attributed to inertia, but to the 
presence of external forces affecting the price system as a 
whole. There is, however, one marked point of difference 
between this distribution and the two others. The tendency 
toward skewness, which was in evidence in the first example, 
is pronounced in this case. The curve, as plotted on the 
arithmetic scale, is mai’kedly asymmetrical. The greatest 
concentration is near the lower Unfit of the scale and a long 
tail, extending in fact far beyond the limit of the chart, 
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tapers out to the right. The highest relative price, indeed, 
is 3,009, representing an increase of 2,909 points. The small- 
est relative price, in comparison, is 36, representing a decline 
of 64 points on the scale. 

A price increase, expressed as a relative, has no upper 
limit. An increase of 100, 500, 1,000 per cent or more is 
conceivable and possible. The greatest price increase noted 
by the War Industries Board in its study of prices during 
the war was one of 4,981 per cent, in the case of acetipheneti- 
din. But 100 per cent is the maximum decline possible, as 
that would mean that the price of a commodity had fallen 
to zero. This is the explanation of the skewness noted in 
the curves shown. When any considerable number of price 
ratios are tabulated the corresponding frequency curve, 
plotted on an arithmetic scale, shows this characteristic 
feature, a feature which is most conspicuous during a period 
of rising prices. 

The argument developed in the preceding pages may be 
briefly summarized. Before discussing the practice of index 
number construction it was considered advisable to study 
the character of the raw materials and the nature of the 
distributions seciued when these materials are brought to- 
gether, in order to determine whether ordinary statistical 
methods are appropriate. The raw materials, we have seen, 
consist of individual price variations, expressed as ratios. 
When a number of these ratios are assembled a frequency 
distribution is secured which somewhat resembles the dis- 
tribution of data following the normal law of error. A 
central tendency, which may legitimately be represented 
by an average, is apparent in the ^stribution of price varia- 
tions. The central tendency is less marked, however, and 
the deviations from it are more pronounced, the longer the 
period covered in the price comparison, so that an average 
becomes less representative as this period increases. In 
addition, a tendency toward skewness has been noted, and 
this was seen to be quite pronounced in a period of rising 
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prices. This skewness is due to the fact that we are deahng 
with ratios that have a definite lower limit and no upper 
limit. 

Variety of Methods Employed in Index Number 
Construction 

Many methods have been and are being employed in the 
construction of index numbers of wholesale prices. Usage 
varies for many reasons. There are differences of opinion 
as to which is theoretically the best method. There are 
practical difficulties to be surmounted, difficulties which 
inevitably cause differences in practice because of the vary- 
ing resources of the agencies engaged in these tasks. And 
there are, finally, differences due to the varying purposes 
for which index numbers are constructed, the varying ques- 
tions they are designed to answer. 

Prevailing differences m practice and differences in the 
results secured by the employment of various methods in 
the construction of index numbers can perhaps be illus- 
trated most effectively by the application of a number of 
methods to the same data. Table 45, on the preceding page, 
presents the raw material to which these various methods are 
to be applied — ^the average farm prices, on December 1, of 
twelve leading crops, from 1919 to 1935. 

explanation of symbols 

The symbols to be employed in the computation of dif- 
ferent types of index numbers have the following meanings: 

pu : price of a given commodity at time “0” (the base period). 
qo' : quantity of .same commodity at time “0”. 

Pi : price of same commodity at time “1 
Qi : quantity of same commodity at time “ 1 ”. 
po" : price of a second commodity at time; “0”. 
qo" : quantity of .second commodity at time “0”. 
p" : price of second commodity at time “ ] 
q/' : quantity of .second commodity at time “ 1 ”, 

— , : a price relative (relation of price of a given commodity at 
Pa 
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time “1 ” to price of same commodity at time “O ”)- 

% : a quantity relative. ■ • 

qo 

Fu : price level at time “0”. 

Pi : price level at time “ 1 

Simple Index Numbers of Prices 
In his exhaustive analysis of methods of index number 
construction ‘ Irving Fisher distinguishes six fundamental 
types: the aggregative (or price aggregate), the arithmetic, 
harmonic, geometric, median, and mode. The latter has 
never been employed in a practical way, and may be omitted, 
llie characteristics of the five remaining types may be 
I)rought out by considering each of them in its simplest form, 
teforc examining the more complicated combinations. 

AGGREGATES OP ACTUAL PRICES 

In the construction of index numbers of the simple ag- 
gregative types commodity prices pertaining to a given 
date are added; general price changes are measured by 
comparing t he results thus secured for different dates. Using 
the above .sjunbols 

Pn Spo 

When such index numbers are constructed from the data of 
Table 45 the results in Table 46 on page 182 are secured. 
The actual aggregates are given in colunan (2); to facilitate 
comparison the same figures are reduced to relatives, with 
the 1910 aggregate as base, in column (3). 

The results secured by this method of constructing index 
numbers of prices will be compared shortly with results 
secured from the same data by other methods. The chief 
weakness of this type of index number is obvious. This is 
not an unweighted nor yet an equally weighted index. 
The influence of each commodity upon the result is depend- 
ent upon the price of the unit in which it happens to be 
* The Making of Index Numbers, Houghton Mifflin Co., 1922. 
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Table 46 


Index Numbers of Farm Crop Prices 
(Aggregates of actual prices) 


(1) 

(2). 

(3) 


Index 

Index ^ relative 

Yem' 

(aggregate of 
actual prices) 

(1919 = 100) 

1919 

$36,349 

100 

1920 

26.790 

74 

1921 

18.690 

51 

1922 

19.913 

55 

1923 

21.838 

60 

1924 

23.142 

64 

1925 

23.831 

66 

19203 

22.499 

62 

1927 

19.291 

, 53 

1928 

19.584 

54 

1929 

21.339 

59 

1930 

18.290 

50 

1931 

13.211 

36 

1932 

9.503 

26 

1933 

13.691 

38 

1934 

20.723 

57 

1935 

12.844 

35 


traded. In the present index, hay, which is quoted by the 
ton, is given more weight than all the other 11 commodities 
combined, with flaxseed second in importance. The index 
secured by adding the quotations is weighted in an entirely 
illogical fashion and cannot be accepted as reflecting the 
course of farm crop prices. 

One method which has been employed for avoiding the 
unequal weighting caused by the difference in units in 
which different commodities are traded is to reduce all 
quotations to the same unit. Thus hay, rice, corn, cotton, 
and the other commodities might all be quoted by the 
pound, and these quotations added to secure the index. 
Yet this method, which has been employed in the con- 
struction of Bradstreet’s index, merely replaces one system 
of illogical weighting by an equally illogical one. Equal 
weight, if such is desired, is not given to all commodities 
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by this method. Thus, in 1919 hay was worth $.010075 
per pound, cotton $.356 per pound and rice $.059 per 
pound, cotton having a weight in an aggregate of per poTind 
prices 6 times that of rice and 35 times that of hay. 

AEITHMBTIC AVEEAGES OP KELATIVE PEICBS 

Another method employed in the construction of index 
numbers involves the reduction of each quoted price to a 
relative, with reference to the price of the same commodity 
at a certain basic date, these relative figures then being 
averaged by any of the conventional methods. The example 
in Table 47 illustrates the first phase of this process, data 
for two years being utilized. The year 1919 is taken as base. 

Table 47 


Computation of Relative Prices for the Construction of Index Numbers 


(1) 


(3) 

(4) 

(5) 

(6) 

(J&mnmdiMj 

IJ’mt 

Prim^ 1919 

Relutwe 

Price, 1920 

Relative 

Corn 

Bu. 

$ 1.343 . 

100 

$ .656 

48.8 

Cotton 

Lb. 

.356 

100 

.139 

39.0 

Hay 

Ton (sh.) 

20.150 

100 

17.780 

88.2 

W'lieat 

Bil 

2.131 

100 

1.433 

67.2 

Oats 

Bu. 

.702 

100 

.456 

65.0 

WIl Potatoes 

Bu. 

1.580 

100 

1.128 

71.4 

Sugar 

Lb. 

. 102 

100 

.053 

52.0 

Barley 

Bu. 

1.215 

100 

.716 

58.9 

Tobacco 

Lb. 

390 

100 

.212 

54.4 

Elaxseed 

Bu. 

4.383 

100 

1.770 

40.4 

Rye 

Bii. 

1 331 

100 

1.256 

94.4 

Rice 

Bu. 

2.666 

100 

1.191 

44.7 




1,200 


724.4 


From these figures the arithmetic averages of relative 
prices in these two years may be readily computed. The 

formula for any single relative is When there are W 

Vo 

relatives the formula for the index number at time “1” is 
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In the present case 

Index (1919) = = 100. 

i- Z ' 

724 4 

Index (1920) = =60.4. 

Index numbers computed in this way for the years 1919 to 
1935, inclusive, are shown in column (3) of Table 50. 

This type of index number is usually termed an “un- 
weighted” index of relative prices. It is weighted, however, 
just as are the types illustrated in the two examples pre- 
ceding. The quantity employed as weight in each case is 
the amount of each conamodity wMch would sell for $100 
in the base year. In the preceding example the following 
quantities have been employed as weights: 


Corn 

74.5 bu. 

Cotton 

280.9 lbs. 

Hay 

4.96 tons 

Wheat 

46.9 bu. 

Oats 

142.5 bu. 

Potatoes 

63.3 bu. 

Sugar 

980.4 lbs. 

Barley 

82.3 bu. 

Tobacco 

256.4 lbs. 

Flaxseed 

22.8 bu. 

Rye 

75.1 bu. 

Rice 

37.5 bu. 


What has been done, in effect, in the computation of the 
simple average of relative prices has been to determine the 
aggregate amount for which the above quantities would sell 
in each of the eleven years included. At 1919 prices each 
of the above quantities would sell for $100, the aggregate 
value being $1,200; at 1920 prices the aggregate value of 
the above quantities was $724.40. These aggregates, di- 
vided by 12, give the index numbers shown in column (3), 
Table 50: 100 for 1919, 60 (60.4) for 1920, etc. Thus the 
“unweighted average of relative prices” is in fact a weighted 
aggregate of actual prices. It is equally weighted in the 
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sense that the value of the quantity of each comnaodity 
employed as weight was equal to $100 in the base year, 1919.^ 

MEDIANS OF BELATIVE PKICBS 

The median rather than the arithmetic mean may be 
employed in securing the average of the relative prices for 
each year. When the relatives in column (6) of Table 47 
are arranged in order of magnitude the following distribution 
is secured : 


39.0 

58.9 

40.4 

65.0 

44 . 7 

67.2 

48. 8 

71.4 

52.0 

88.2 

54.4 

94.4 


The smallest relative price is 39.0, the greatest 94.4; 
the median value is 56.65. This median value is the index 
number for 1920. All the index numbers computed in this 
way from the nu'dians of relative prices are presented in 
column (4), Table 50. 


OEOMETRfC AVERAGES OP RELATIVE PRICES 


The geomelric averages of the relative prices for the 
various years may now be computed and the results com- 
pared with those secured in the preceding examples. A 

single relative being represented by the S3Tnbol the 

Po 

formula for the geometric mean of N relatives is 



Pl 

Po" 


X 


pl 

Po‘ 



A geometric mean is generally computed by the aid of 
logarithms; in this case 


Log i¥„ = 


N 



* Attention was called to this characteristic of the simple average of relative 
prices by F. li. Macaulay, Anierican Ecommic BevieWt Bea., 1915, 928.. ■ 
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The method of computation may be illustrated for the 
years 1910 and 1911. The relative prices of the various 
commodities are repeated from Table 47. 

Table 48 


Computation of Geometric Averages of Relative Prices 


(1) 

(2) 

(3) 

(4) ^ 

(5) 

Commodity 

Relative price, 

Logarithm, of 

Relative price, 

Logarithm of 

1919 

fig. in coL (2) 

1920 

fig. in col. (4) 

Corn 

100 

2.0 

48.8 

1.68842 

Cotton 

100 

2.0 

39.0 

1.59106 

Hay 

100 

2.0 

88.2 

1.94547 

Wheat 

100 

2.0 

67.2 

1,82737 

Oats 

100 

2.0 

65.0 

1.81291 

Wh. Potatoes 

100 

2.0 

71.4 

1.85370 

Sugar 

100 

2.0 

52.0 

1.71600 

Barley 

100 

2.0 

58.9 

1.77012 

Tobacco 

100 

2,0 

54.4 

1.73560 

Flaxseed 

100 

2.0 

40.4 

1.60638 

Rye 

100 

2.0 

94.4 

1.97497 

Rice 

100 

2.0 

44.7 

1.65031 



24.0 


21.17231 


LogM, (1919) = ^ = 2 

Me = anti-logarithm of 2 = 100 
91 179^1 

Log M, (1920) = = 1 .76436 

jlm 

M„ = anti-logarithm of 1.76436 = 58.1. 

This value, 58.1, is the index number for 1920. The 
results for all the years are summarized in column (5), 
Table 50. 

HABMONIC AVBEAGES OF RELATIVE PRICES 

The characteristics of the harmonic average have been 
discussed in a preceding chapter. The reciprocal of the 
harmonic mean, it will be recalled, is the arithmetic mean 
of the reciprocals of the constituent measures. The con- 
stituent items, in the present case, are price relatives of the 
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form The reciprocal of such a relative is The 
po , 

formula for the harmonic mean of N price relatives is, 
therefore, 


or 


H 


H 


Po' j- Psl' + P?"' J- 

Pi Pi Pi 

A 


N 




The method of computation is illustrated in Table 49. 


Table 49 


Campniation of Harmonic Averages of Relative Prices 


(1) 

(2) 

(3) 

(4) 

.(5) 

Commodity 

Rclaiwe price^ 

Reciprocal of 

Relative priee^ 

Reciprocal of 

1919 

fig, m col (2) 

1920 

fig. in col. (4) 

Corn 

iOO 

.01 

48.8 

.02049180 

Cotton 

100 

.01 

39.0 

.02564103 

Hay 

100 

.01 

88.2 

.01133787 

Wheat 

100 

.01 

67.2 

.01488095 

Oats 

100 

.01 

65.0 

.01538462 

Wh. Potatoes 

100 

.01 

71.4 

.01400560 

Sugar 

100 

.01 

52.0 

.01923077 

Barley 

100 

.01 

58.9 

.01697793 

Tobacco 

100 

.01 

54.4 

.01838235 

Flaxseed 

100 

01 

40.4 

.02475248 

Bye 

100 

.01 

94.4 

.01059322 

Bice 

100 

.01 

44.7 

.02237136 



.12 


.21404998 


H (1919) 

o 

o 

II 

II 




H (1920) 

12 

= 56. L 



.21404998 



^ The index numbers computed in this way for all the years 
included in the study are shown in column (6), Table 50. 
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In the construction of the five types of index numbers 
explained above no attempt has been made to use a logical 
weighting system. All are termed “unweighted” averages, 
a term which is quite misleading. The first index con- 
structed, based on aggregates of actual prices, is a heavily 
weighted index number, though the weights are illogical. 
In the next four the quantities employed as weights are the 
amounts purchasable for $100 in 1919. The five results 
are brought together and compared in Table 50. In each case 
the index is given to the nearest whole number. These index 
numbers are plotted in Fig. 52. 

Comparison of Simple Index Numbers 

The four averages of relative prices agree much more 
closely with each other than with the index numbers based 
on aggregates. For reasons already suggested the latter is 
quite untrustworthy as a measure of price changes. Of the 
other index numbers, the arithmetic, geometric, and har- 
monic means show a consistent relationship, a fact which 
follows from the nature of the averages employed. Except 
in the base year the geometric mean is always less than 
the arithmetic and the harmonic is always less than the 
geometric, the amount of difference increasing as the dis- 
persion of prices becomes greater. The median, with only 
twelve items to be averaged, is somewhat unstable, and its 
relationship to the other averages is not always a consistent 
one. • 

How are we to choose among these varying results? No 
one of these “unweighted” index numbers is perfect, for 
weights which have crept in do not measure the relative 
importance of the various commodities included in the 
index numbers. But, neglecting for the moment the question 
of weights, is it possible to test the adequacy of the different 
methods of measuring changes in the prices as given? 
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Table 50 


Index Numbers of Farm Crop Prices, 1919-1935 
(1919 = 100) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Aggregates 

Arithmetic 

Medians 

Geometric 

Harmonic 

Year 

of actual 

averages of 

of 

averages of 

averages of 

prices {as 

relative 

relative 

relative 

relative 


relatives) 

prices 

prices 

prices 

prices 

1919 

100 

100 

100 

100 

100 

1920 

74 

60 

57 

58 

56 

1921 

51 

44 

42 

43 

42 

1922 

55 

51 

50 

50 

49 

1923 

60 

55 

50 

54 

53 

1924 

64 

60 

61 

59 

58 

1925 

66 

59 

53 

57 

55 

1926 

62 

53 

49 

52 

50 

1927 

53 

53 

55 

52 

52 

1928 

54 

48 

48 

47 

46 

1929 

59 

54 

53 

53 

52 

1930 

50 

38 

32 

36 

35 

1931 

36 

27 

27 

27 

26 

1932 

26 

20 

18 

19 

19 

1933 

38 

35 

33 

34 

34 

1934 

57 

48 

48 

46 

43 

1935 

35 

35 

36 

35 

34 



Fig. 52. — Comparison of Five Simple Index Numbers of Farm Crop 
Prices, 1919-1935 (1919 = 100) 
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THE TIME REVEESAL TEST 

For this purpose Irving Fisher has employed what he 
terms the “time reversal test.” This is merely a test to 
determine whether a given method will work both ways in 
time, forward and backward. If from 1935 to 1936 sugar 
.should increase from four to eight cents a pound, the price 
in 1936 would be 200 per cent of the price in 1935, and the 
price in 1935 would be 50 per cent of the price in 1936. 
One figure is the reciprocal of the other; their product 
(2.00 X .50) is unity. Similarly, if a given method of index 
number construction shows the general price level in one 
year to be 200 per cent of the level in the preceding year, it 
should work correctly when reversed; it should show that 
the price level in the first year was 50 per cent of the price 
level in the second year. When the data for any two years 
are treated by the same method, but with the bases reversed, 
the two index numbers secured should be reciprocals of each 
other. Their product should always be unity. If it is not, 
there is an inherent bias in the method. 

This test may be applied to the methods employed above, 
using prices for 1919 and 1920. With 1919 as base the 
following results were obtained: 


Fuar 

Aggregates 
of aciml 
pn€e.s (as 
relatives) 

Anthmeiic 
averages of 
relative 
prkes 

Medians of | 
relaim 
piices 

Geometric 
averages of 
reMive 
prices 

Earmonic 
averages of 
relaiwe 
prices 

1919 

1920 

100 

73.70216 

100 

60.36666 

' 100 

56.65 

100 

58. 1221 

' 100 

56.0617 


and with 1920 as base: 


Year 

Aggregates 
of miwd 
prices (m 
rdaiwes) 

Arithmetic 
averages of 
rdative 
prices 

Medians of 
relative 
prices 

Geometric 
averages of 
rdative 
prices 

Harmonic 
averages of 
relative 
prices 

1919 ' 

loao 

135.68122 
100 1 

178.36666 

100 

176,85 

100 

172.04 

100 

165.6467 

100 
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When the index numbers for 1911 in the first table are 
multiplied by the corresponding index numbers for 1910 
in the second table, we have the following values. (In 
securing these products the index numbers are put in the 
ratio, not in the percentage form.) 


j 

Aggregates 
of actual 

Arithmetic 
averages of 
relative 

Mediam of 
relative 

Geometric 
averages of 
relative 

Harmonic 
averages of 
relative 

prices 

prices 

prices 

prices 

prices 

■ 1.00 * 

1.0767 

1.00 

1.00 

. 9286 


This time reversal test is met by three of the methods 
employed. It is not met by either the arithmetic or har- 
monic averages. The former has a distinct upward bias, 
amoimting to more than seven per cent when the errors for 
1919 and 1920 are compounded, while the harmonic mean 
shows almost as large an error in the opposite direction. 
Unless the inherent bias which is found in both these aver- 
ages is rectified in some way, methods based upon these 
averages should not be used in the construction of index 
numbers. 

The Weighting of Index Numbees 

Five simple index numbers of prices have been described 
in the preceding section. With the introduction of weighting 
the number of possible combinations is greatly increased, 
but only a few of these types need concern us here. 

In the construction of an accurate measure of price changes 
logical weights must be employed, weights which truly reflect 
the relative importance of the commodities included. If the 
weighting problem is ignored haphazard and illogical weights 
will inevitably be present, whether recognized or not. 

The data used in the preceding examples may be utilized 
to illustrate methods of weighting and to show the effects 
of varying weights upon the values of index numbers. The 
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I 


weights employed in constructing index numbers of farm 
crop prices may be either the quantities or values of the 
crops produced, depending upon the type of index selected. 
The quantities produced during the period 1919-1935 are 
given in Table 51. 

WEIGHTED AGGREGATES OF ACTUAL PRICES 

The thoroughly illogical results obtained when actual 
prices, as quoted, are totaled to secure an index number 
have been pointed out. The same objection cannot be 
made when the prices are appropriately weighted before 
the aggregate is taken. If for weights we employ the quan- 
tities produced in the base year (at time “0”) the formula 
for the weighted aggregate is 

This is, in effect, the method employed by the United States 
Bureau of Labor Statistics, though the quantities are taken 
from a year other than the base year. The formula for this 
type of weighted aggregative index is known as Laspeyues’ 
formula. The method is illustrated in Table 52. 

The desired index numbers, in the form of relatives, may 
be computed from the aggregates secured by totaling col- 
umns (5) and (8) of Table 52. Either year may be taken 
as the base, and the price aggregate in the other year ex- 
pressed as a relative on this base. With the 1919 aggregate 
as base the index for 1920 is 58.2. Index numbers similarly 
computed for the other years are given in column (2), 
Table 55. 

Another type of weighted aggregate may be constructed, 
with weights taken not from the base period but from the 
later period in the given comparison. That is, we may 
employ (quantity at time “1”) as weight in comparing 
prices at time “1” with prices at time “0,” and employ 
(quantity at time “2”) as weight in comparing prices at 
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time “2” with prices at time “0.” Algebraically, the 
formula for the index number at time “ 1 ” is 

Spo?i 

This is known as Paasche’s formula. The process of compu- 
tation is precisely the same as in the preceding example, 
except that the weights are changed with each successive 
year. The index numbers secured by this method are given 
in column (3), Table 55. 

The weights in these two cases have been quantities, for 
prices, multiplied by quantities, give aggregates in dollar 
values. But in weighting individual price relatives, quanti- 
ties will not serve. The abstract relatives must be weighted 
by values, if the resulting products are to be comparable. 
For values are in terms of a common dollar unit, while 
quantities may be expressed in a variety of units. The values 
which are to be employed as weights may be derived in 
various ways. 

Fisher ^ outhnes the four following methods, of which the 
second and third are hybrid types: 

I. Each weight = base year price X base year quantity (po?o). 

II. Each weight = base year price X given year quantity (po?i) . 

III. Each weight = given year price X base year quantity (pigo). 

IV. Each weight = given year price X given year quantity (pifi). 

Just as certain averages possess inherent bias, so a distinc- 
tive weight bias arises from each type of value weighting. 
(This inherent bias is absent from the quantity weighting.) 
A downward bias arises from weighting systems I and II 
(in which base year prices are used), while an upward bias 
arises from weighting systems III and IV (using prices in 
the given year). This is in part capable of mathematical 
demonstration ® and has in part been established by numer- 
ous trials. 

* Irving Fisher, The Making of Index Numbers, 54. 

’ An index weighted by type III must exceed an index weighted by type I. 

(Footnote S continued on page 196.) 
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In the several examples next following we shall deal only 
with values of quantities produced in the base year, 1919. 
These values are given in the third column of Table 53. For 
weighting purposes they are taken to the nearest million. 

■WEIGHTED AEITHMETIC AVERAGES OF 
RELATIVE PRICES 

In the computation of an index of this type, each relative 
is multiplied by the appropriate weight and the sum of the 
products is divided by the sum of the weights. The process 
is illustrated in Table 53. 

The index for 1920, it will be noted, is identical with that 
secured from the computations illustrated in Table 52. That 
index is a weighted aggregate of actual prices, the weights 
being the quantities produced in the base year. An arith- 
metic mean of relative prices, weighted by values in the base 
year, is always equal to a relative constructed from such an 
aggregate.^ 

(Footnote 2 contmued from page 19o.) 

Weighting the price relative of a given commodity by type III, we have 

— X piqo 
Po 

while by type I we have 

— X poqa. 
p0 

If Pi exceeds pa (if the price relative is above 100) the weight by type III 
(pm) is greater than the weight by type I (pogo). That is, ail relatives above 
100 are more heavily weighted by type III than by type I But if pt is less than 
po the weight by type III (pigo) is less than the weight by type I (pogo). Ail 
relatives below 100 are less heavily weighted by type III than by type I. Thus 
the effect of all price increases is over-einphasigjed and the effect of all price 
declines is imder-emphasixed by type III, giving a net result always greater 
than type 1. The same is true of type IV as compared with type 11. As be- 
tween types I and TV there is no necessary relation, but in general an index 
weighty by type IV will exceed an index weighted by type I. Base year 
weighting involves a downward bias while given year weighting involves an 
upward bias. (For a more detailed discussion of bks in weighting see Fisher, 
The. Making of Index Numbers, Chapter V and pages 384-387.) 

^ This may be readily demonstrated algebraically. The value of any com- 
modity in the base year is pogo, while the price relative for a second year is ~ • 

Po 

(Footnote 1 continued on page 197.) 
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Table 53 


Confutation of Weighted Arithmetic Averages of Relative Prices 


Com- 

modity 

Relative 

price, 

1919 

Weight 

Relative 

price 

X weight 

Relative 

price, 

1920 

Weight 

Relative 

price 

X weight 

Corn 

100 

$3,598 

$359,800 

48.8 

S3, 598 

S175,5S2.4 

Cotton 

100 

2,031 

203,100 

39.0 

2,031 

79,209.0 

Hay 

100 

1,543 1 

154,300 

88.2 

1,543 

136,092.6 

Wheat 

100 

2,029 

202,900 

67.2 

2,029 

136,348.8 

Oats 

100 

777 

77,700 

65.0 

711 

50,505.0 

Potatoes 

100 

470 

47,000 

71.4 

470 

33,558.0 

Sugar 

100 

446 

44,600 

52.0 

446 

23,192.0 

Barley 

100 

159 

15,900 

58.9 

159 

9,365.1 

Tobacco 

100 

563 

56,300 

54.4 

563 

30,627.2 

Flaxseed 

100 

30 

3,000 

40,4 

30 

1,212.0 

Rye 

100 

105 

10,500 

94.4 

105 

9,912.0 

Rice 

100 

114 

11,400 

44.7 

114 

5,095.8 



111,865 

$1,186,500 1 


$11,865 

$690,699.9 


(The weights employed are the values of the quantities produced in 
1919, in millions.) 


Weighted arithmetic mean (1919) = = 100 

{| 1 1,500 

Weighted arithmetic mean (1920) ~ = 58.2 

511,500 


(Footnote 1 continued fro?n page 196.) 

The weighted mean of such price relatives is equal to 
Vi Pi' Pi" 

X poV + ^ X Po"?o" + ^ X Po"V" + . . . 

Po Po Pq 

Poqo' -Fpo"qo" + Vo"qo'" . 

which reduces to 

Spogc’ 

a weighted aggregate of the type mentioned. 

In the same way the harmonic mean, weighted by full values in the second 
year, reduces to 

Spogi 

This has already been encountered as an aggregate of actual prices weighted 
by quantities in the second year. 
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WEIGHTED GEOMETRIC AVERAGES OP 
RELATIVE PRICES 

The process of computing the weighted geometric mean 
is identical with that of computing the unweighted geometric 
mean, except that the logarithm of each relative is multi- 
plied by the given weight and the sum of these weighted 
logarithms is divided by the sum of the weights, the result 
being the logarithm of the desired index. ^ The method is 
illustrated in Table 54. 


Table 54 

Computation of Weighted Geometric Average of Relative Prices, 1920 

^1919 = 100 ) 


Commoditij 

C<->rn 

Cotton 

Hay 

Wheat 

Oats 

Potatoes, Wh. 

Sugar 

Barley 

Tobacco 

Flaxseed 

Rye 

Rice 


Relative 
rice^ 1920 

Logarithm of 
relative 'price 

Weight 

Logarithm of 
relative price 



X weight 

48.8 

1.68842 

3,598 

6074.93516 

39.0 

1.59106 

2,031 

3231.44286 

88.2 

1.94547 

1,543 

3001.86021 

67.2 

1.82737 

2,029 

3707.73373 

65.0 

1.81291 

777 

1408.63107 

71.4 

1.85370 

470 

871.23900 

52.0 

1.71600 

446 

765.33600 

58.9 

1.77012 

159 

281.44908 

54.4 

1.73560 

563 

977.14280 

40.4 

1.60638 

30 

48.19140 

94.4 

1.97497 

105 

207.37185 

44.7 

1.65031 

114 

188.13534 



11,865 

20,7^.46850 




M„ = 56.2 


The index for 1920 on the 1919 base is 56.2. Measure- 
ments secured for all the years of the period covered are 
given in column (5), Table 55, together with the other 
wei^ted index numbers already explained. 

‘ The formula for the weighted geometric mean is given in Chapter IV. 
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How are we to judge of the relative merits of these three 
index numbers? We may, first, apply the time reversal 
test which was employed in comparing the five simple index 
numbers. This test is not met by any of the weighted types 
we have constructed. The geometric is equally at fault 
with the others. Though the simple geometric meets the 
test, the introduction of weighting imparts a bias to the 
result. Judged by that test alone none of the three is sat- 
isfactory. We may next try the second fundamental test 
that Fisher has developed, which is termed the “factor 
reversal test.” 


THE FACTOR REVERSAL TEST 


The total value of a given commodity in a given year is, 
of course, the product of the quantity produced and the 
price per unit; algebraically, it is equal to p'q'. The ratio 
of the total value in one year to the total value in the preced- 
ing year is ^^7%- If, from one year to the next, both price 
po go 

and quantity should double, the price relative would be 200, 
the quantity relative 200, and the value relative 400. The 
total value in the second year would be four times the value 
in the first year. The value relative would be equal to the 
product of the price and quantity relatives, a relationship 
which is obvious in the case of a single commodity. 

If, for a number of commodities, we construct an index 
of the price change from one year to the next and an index 
of the quantity change from one year to the next, we should 
expect their product to be equal to the ratio- of the total 
values in the second year to the total values in the first 
year. If the product is not equal to the value ratio, there 
is an error in one or both of the index numbers. 

As an illustration, we may apply this test to the first 


aggregative index constructed 



An index of quan- 


tities may be computed from this same formula, merely 
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interchanging the and the p’s; the formula becomes 

Sgipo 

Sgopo 

The same price factor appears in numerator and denom- 
inator, as we desire to measure only the effect of the quan- 
tity change. Substituting the given values of the twelve 
farm crops we have 

, $12,998,610,800 _ 
$11,864,461,250 ~ ^ 

In percentage form the index of quantities produced in 
1920 is 109.56, with 1919 as base. The corresponding price 
index, by the same formula, is 58.24. The product 
1.0956 X . 5824 = .6381. 

That is, if prices have decreased 41.76 per cent, while 
quantities have increased 9.56 per cent, the total value 
should show a decrease of 36 . 19 per cent, 
lor the value ratio we have 


Quantity index, 1920 (1919 = 100) 


Spigi $7,441.317,450 


= .6272. 


Spoffo $11,864,461,250 
There is a discrepancy here of about one per cent. The 
actual error is not great, but the formula definitely fails to 

Saao ~ - 

we^secur^th^f n aggregative index 

mg i basf respect to 

Spoqi 

107.69 

Iroduct = .5725 X 1.0769 = .6165 

(In securing the product the index numbers are nut in 
the ratio, not in the percentage form.) 

Here is an error of the same magnitude in the other direction. 


Price index 


57.25 


Quantity index = 

2goPi 
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The weighted geometric average also fails to meet this 
fundamental factor reversal test. With respect to both the 
geometric index and the aggregates we have, apparently, 
by the introduction of weights spoiled index numbers which 
in their simple form were unbiased. Yet weights we must 
have, if the index numbers are to represent the facts ac- 
curately. Neither a simple index nor a weighted form of a 
simple index will meet the two tests laid down as funda- 
mental. Professor Fisher tested 46 such formulas, of which 
only four (the simple geometric, median, mode, and ag- 
gregative) met the time reversal test, and none met the 
factor reversal test. 


THE “ideal” index 


A way out of this difficulty is offered by the possibility 
of “rectifying” formulas in a crossing process, by averaging 
geometrically formulas which err in opposite directions. 
Professor Fisher has made exhaustive trials of all possible 
formulas by this process, finding thirteen formulas in all 
which met both tests. Of these he has selected one as 
“ideal,” from the viewpoint of both accuracy and simplicity 
of calculation. This ideal index is the geometric mean of the 
two aggregative types illustrated above. Its formula ^ is 



V 


This index may be computed readily, in the present 
instance, from the results already obtained. Thus for 1920 
we have 


Idealindex = ■\/.6824X .5725 
= .5774. 


In the customary percentage form this is 57 . 74. 

This index number meets both the time reversal and the 
factor reversal test. Applying the former: 

' The same formula was developed independently by Bowley, Pigou, Walsh, 
and Young. See The Making of Index Numbers, xv, 240-242. 
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Index of prices, 1920 (1919 = 100) = 57.74 
Index of prices, 1919 (1920 = 100) = 173.18 
.5774 X 1.7318 = 1.00. 

For the factor reversal test, applied to the data for 1920 
(with 1919 as base), we have 


Index of prices 


Index of quantities 


Value ratio = ^‘ ^ .6272. 

Product of price and quantity indices = . 5774 X 1 . 0862 = . 6272. 

The ideal index, the two weighted aggregates that enter 
into its construction and the geometric mean weighted by 


Aggregative (weighted by base year quantites) 

Aggregative (weighted by given year quantites) 

• ideallndex 

•Weighted geometric averaged weighted by base year quantites) 


Fta 53. — Comparison of Four Weighted Index Numbers of Farm Crop 
Prices, 1919-1935 (1919 = 100) 

values in the base year are given in Table 55 for the years 
1919 to 1935. The index numbers are plotted in Fig. 53, 
The wide discrepancies that were found between the vari- 
ous simple index numbers do not appear when the weighted 
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Table 55 

Comparison of Weighted Index Numbers of Farm Crop Prices, 

1919-1935 


(1) 

(2) 

(3) 

(4) 

(5) 

Year 

Aggregative 
(weighted by 
base year 
quantities) 

Aggregative 
(weighted by 
given year 
quantities) 

Ideal index 
Geometric 
mean of m- 
dices in cols. 
(2) and (3) 

Weighted 
geometric average 
(weighted by 
base year 
quantities) 


Spigo 

Spigi 




Spogo 

Spogi 



1919 

100.0 

100.0 

100.0 

100.0 

1920 

58.2 

57.2 

57.7 

56.2 

1921 

42.8 

42.0 

42.4 

41.5 

1922 

53.6 

53.1 

53.4 

52.9 

1923 

59.8 

59.7 

59.8 

58.1 

1924 

65.0 

64.3 

64 6 

64.4 

1925 

57.9 

56.3 

57.1 

56.5 

1926 

51.4 

49.2 

50.3 

49.6 

1927 

54.5 

54.3 

54.4 

54.3 

1928 

51.8 

51.1 

51.4 

51.2 

1929 

54.1 

53.3 

53.7 

53.4 

1930 

41.3 

39.6 

40.4 

39.4 

1931 

26.6 

25.5 

26.0 

25.3 

1932 

19.0 

18.9 

19.0 

18.1 

1933 

32.6 

32.2 

32.4 

32.2 

1934 

51.6 

52.1 

51.8 

49.5 

1935 

38.1 

37.6 

37.9 

37.8 


indices are compared. There are significant differences, but 
there is none of the erratic behavior of some of the simpler 
forms. 

Of these four types the ideal index probably serves 
as the best measure of the average price change between 
1919 and each of the given years. ^ It is designed, it should 
be remembered, to measure the change between two stated 
times, and not for intermediate comparison. The value of 
the index for 1933, for instance, is determined by the rela- 
tion between prices and quantities in 1919 and in 1933. 

^ The year 1919, which is here employed as base, is not a satisfactory stand- 
ard of reference for economic purposes. It was a disturbed year, marking a 
transition from war-time to peace-time conditions. 
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There is double weighting and the weights vary from year 
to year. If 1933 is to be compared with 1932 a new index 
is needed, in which the prices and quantities for 1933 and 
1932 alone are included. Direct comparison on the basis 
of the values for the ideal index given in the above table 
is liable to error, because of the weighting system employed. 

It is one of the merits of the geometric mean with constant 
weights that it permits the index for each year to be com- 
pared directly not only with the base year index, but with 
the index for any other year. The base may be shifted 
directly from the relatives, and the same result will be 
secured as if the computation were made from the original 
data. If this same system be followed with the ideal index 
no large errors may be expected, but strict accuracy will 
not be secured.^ 

SOME ALTEKNATIVE TYPES 

The chief obstacles in the way of general adoption of the 
ideal index arise from the difficulty of obtaining annual or 
monthly quantities to use as weights, and from the time 
involved in its computation. Where accuracy is essential 
the latter is not a serious difficulty. As a substitute formula 
which is much more quickly calculated Fisher has proposed 

^(go + Qi)pi 

S(go + qi)po 

This formula, which has also been recommended by Edge- 
worth and Marshall, is considered by Fisher to be “the 
best practical all-around formula, taking all four points 
into account — accuracy, speed, minimum legitimate cir- 
cular discrepancy, simplicity.” Results from this formula 
will generally differ from those secured from the ideal for- 

* If year to year comparison be a primary aim in a given instance, the ideal 
index may be constructed on the chain system. Link index numbers are first 
constructed, each year serving as base for the computation of the index for the 
succeeding year. These links may then be “chained” with reference to a fixed 
base. Warren M. Persons has shown that the errors involved in following this 
method are cumulative, and may be serious if the links are chained for a number 
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Computation of Aggregative Index ^ Weighted hy Combined Quantities 
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mula by less than one fourth of one per cent. Table 56 
on page 205 illustrates the method of computation, data for 
1919 and 1920 being employed. 

This formula requires the same data as the ideal index, 
and these are not generally to be had. Usually it is only 
possible to secure comprehensive quantity figures at each 
census period, and for the intervening years constant weights 
must be employed. In such cases the weighted aggregative 

Spogo 

is probably the most generally useful type. The weighted 
geometric has many virtues, but is subject to a definite 
weighting bias. If no weights can be secured, or even ap- 
proximated, the simple geometric and the simple median 
are far better than any of the other simple types. The 
geometric mean is more generally useful than the median. 

An index number of prices is always based upon the study 
of a sample, the result being taken as representative of the 
entire field of prices from which the particular sample was 
drawn. Some method is needed, therefore, by which we may 
judge of the reliability of the different types of index num- 
bers, of their probable stability when computed from a 
number of successive samples. Some differences might be 
expected between index numbers based upon different sam- 
ples. With which type of index number would these differ- 
ences due to fluctuations of sampling be least? ^ 

Truman L. Kelley ^ has attempted to measure the prob- 
able errors of the chief types of index numbers and has 
graded these types on the basis of excellence in this respect. 
Two index numbers, the weighted geometric mean and the 
weighted median, are given the highest grade, as being the 
most reliable, the least affected by fluctuations of sampling. 

* The subject of sampling, in relation to the reliability of statistical measures, 
hi diicu®ed in greater detail below. 

* Truman L. Kelley, Staiistical Method, New York, Macmillan, 1921 , 

mrm. 
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Fisher’s ideal index is ranked somewhat lower, though above 
the weighted arithmetic and harmonic averages of price 
relatives. The simple unweighted arithmetic average of 
relatives is given the lowest rating in the list. 

For reliability, flexibility, and general excellence Kelley 
selects the weighted geometric mean as the best type of 
price index number. A ratio of aggregates 

SpiW 

SpoW 

with selected weights (not necessarily precisely equal to the 
quantities marketed or consumed) is given a total score, 
based on the essential requirements of a good index number, 
as high as that of the weighted geometric mean and higher 
than that of the ideal index. Weights other than actual 
quantities are used in order that there may be flexibility 
in the matter of weighting. 

The detailed discussion of procedures in the preceding 
pages has clearly shown that there are some definitely faulty 
formulas, obviously unsuited for use in the construction of 
index numbers serving ordinary purposes. Among the better 
formulas there are some differences in respect of liability 
to bias and character of data needed, and some variations in 
sampling reliability. The maker of index numbers will have 
these in mind in choosing a formula to employ under given 
conditions. A more important factor in his choice, however, 
will be the purpose to be served by the index number, the 
question it is designed to answer. A weighted aggregate of 
actual prices answers one question definitively. It gives, 
without equivocation, the aggregate cost of a fixed bill of 
goods at one period, in relation to the cost of the same bill 
of goods at another. A geometric mean of relative prices 
answers another question. It measures with accruacy the 
average ratio of the prices of given commodities at one period 
to corresponding prices at another period. Some questions 
(for example, that answered by an unweighted arithmetic 
average of relative prices) have little if any economic sig- 
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nificance. It is because one or two main questions have 
bulked large in economic discussion that emphasis has been 
placeduponthe findingof a “best” type of index number. Yet 
the terms “ best ” and “ideal ” are unfortunate, for they imply 
that some absolute standard exists, with reference to which 
all formulas may be tested. No such absolute criterion may 
be applied to the diversity of research problems that call 
for the construction of index numbers. On the basis of his 
knowledge of the characteristics of different formulas, the 
discriminating investigator will choose technical methods 
adapted to his data and appropriate to his purposes. 

Other Pkoblems Involved IN THE Constritction 
OF Price Index Numbers 

The preceding section has dealt with the technical prob- 
lems connected with the averaging of a given set of data 
in order to secure an index number of price variations. 
(Certain methods have been shown to be quite faulty, while 
certain othcre have been found to be appropriate for given 
purposes. One who would use index numbers with intelli- 
gence should understand fully the methods which have 
been employed in securing given results, in order that he 
may know precisely what the given figure is designed to 
measure and what degree of reliability attaches to it. 

Such problems as these are not the only ones which 
confront those who construct index numbers, nor are these 
considerations the only ones which users of index numbers 
should bear in mind. Of equal importance with problems 
of averaging and weighting are the practical questions con- 
nected with the selection of representative samples. The 
only completely accurate measure of the general level of 
commodity prices would be secured by determining the ratio 
between all money units, including credit, in circulation 
(account being taken of velocity of circulation) and all the 
physical units of goods exchanged for money over a given 
period. The measurement of general price changes between 
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two periods would thus involve complete knowledge of these 
two factors for each of the two periods. Such knowledge, 
of course, caimot be had, so recourse must be had to the 
method of sampling. And primary importance attaches to 
the number of commodities and the character of the com- 
modities upon the prices of which a given index number is 
based. 


NUMBEE OF COMMODITIES TO BE INCLUDED 

Here again we are confronted with a relation that has 
already been mentioned, the relation between methods and 
uses. Decision as to the number of commodities and the 
kinds of commodities to be included in a given case must 
rest upon the purpose for which the index is to be con- 
structed. Assuming that the index number is to serve as a 
measure of general changes in the price level, the ques- 
tion as to the number of commodities to be included may 
be easily answered — the larger the sample the more rep- 
resentative will be the results. The frequency polygon 
based upon a large sample will approach more closely to the 
ideal curve which would represent all price quotations than 
will that based upon a small sample. Thus, as a measure 
of general price changes, more confidence may be placed 
in the Bureau of Labor Statistics index, which is based 
upon 813 price quotations, than in Bradstreet’s, which was 
based upon 96 quotations, though the latter had particular 
virtues of its own.^ Yet index numbers based upon a small 
number of quotations may not be ruled out as without 
value. Wesley C. Mitchell, whose researches have ma- 
terially increased our knowledge of the price system and of 
the characteristics of index numbers, has compared in detail 
index numbers based upon varying numbers of quotations. 
Unexpected similarities are found. Those constructed from 
a limited number of quotations reflect the broad movements 
of prices in much the same way as do those based upon the 

^ Bradstreet’s index was discontinued at the end of the year 1937* 
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prices of several hundred commodities. In important details 
there are differences, however, differences which may in- 
volve doubt as to the movement of prices in a given year. 
In such cases the index numbers based upon many quota- 
tions must be accepted as more accurate measures of general 
price movements, provided that the commodities included 
be equally representative of the various elements of the price 
system. 

For other purposes, however, index numbers based upon 
a limited number of quotations may be preferable. This 
is particularly true when a "sensitive” index is desired, one 
that will serve as a forecaster of general price movements 
rather than as a precise measure of changes in the general 
price level. Of this type is the Harvard sensitive price index 
based upon quotations on 13 basic commodities (raw ma- 
terials). The purposes of such an index are served by the 
selection of a limited number of commodities the prices of 
which arc subject to extreme fluctuations, rather than by 
the inclusion of a great many commodities. Yet the uses 
to which an index of this type may be put are limited. 
The “sluggishness” of the many-commodities index number 
is a sluggishness which inheres in the price system, and 
which must be reflected in a faithful index of general 
prices. 

The question of the number of commodities to be included 
cannot be discussed apart from that of the character of 
these commodities. The representative character of an index 
number rests in part upon the number of price series in- 
cluded, but the nature of these series is of even greater 
importance. For there are highly significant differences in 
the behavior of the prices of different commodity groups. 
These groups of prices, their interrelations, their behavior, 
their relation to the functioning of the economic system 
and to the swings of prosperity and depression, are matters 
of immediate and practical importance to economists and 
business men. 
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PEICE GROUPS IN THE FIELD OP WHOLESALE PRICES 

Since an index number of wholesale prices must rest upon 
sample quotations, the sample must be representative, must 
include commodities whose prices are typical of the various 
elements in the price system. The division into elements 
for this purpose must be based upon the character of the 
price changes peculiar to the different groups. Of the 
groups thus distinguished, the most obvious are those rep- 
resenting different industries. Textile prices and steel prices, 
leather prices and the prices of chemicals are subject to 
different influences. Trade depressions and revivals do not 
affect all industries at the same time or in the same way, 
so that an index of wholesale prices must include quotations 
from all important industrial groups. If preponderant in- 
fluence upon an index is exerted by the prices of certain 
types of commodities, the index, by that much, loses its 
representative character. Thus Bradstreet’s index, it has 
been established, gave greater weight to cotton fabrics, 
hides and leather, and cured meats than was justified by their 
actual importance in trade, a fact which did not detract 
from its utility for some purposes but which lessened its 
value as a representative index of wholesale prices. 

The extent of these differences between the price move- 
ments of commodities in different industrial groups may be 
appreciated by comparison of the index numbers of whole- 
sale prices of grains and metals and metal products during 
the business recession that began in the summer of 1937. 

In order that an index may be representative it is not alone 
sufficient that all industries be given an appropriate number 
of representatives in the sample. Raw materials and man- 
ufactured goods show characteristic differences in their fluc- 
tuations, and fitting representation must be given to each 
of these groups. Prices of the former are, in general, more 
sensitive to changes in business conditions, their movements 
preceding those of manufactured goods and showing more 
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violent fluctuations. There are several reasons for this. 
Raw materials are traded in for purposes of manufacture 
and sale. When business improves after a period of depres- 
sion, increased demand on the part of consumers (or expected 
increase in demand) leads competing manufacturers to bid 
against each other for raw materials. It is in the raw ma- 
terial markets that the pressure of increased demand first 
centers, and this bidding generally causes prices to rise in 
these markets before the prices of other goods are affected. 
Similarly, at the first evidence of slackening trade manu- 
facturers’ demand for raw materials falls off. Business 
forces pure and simple play in the raw material markets 
with more freedom than in the markets for manufactured 
goods. Hence the tendency of prices in these markets to 
anticipate, in their movements, prices in other commodity 
markets. 

Additional reasons for the greater stability of prices of 
manufactured goods are found in the fact that these prices 
include a greater percentage of stable cost factors, and in 
the control over supply exercised by most manufacturers. 
Wages, interests, rents move more slowly and less violently 
than do commodity prices. The inclusion of these elements 
in commodity prices tends to render these prices more stable. 
Therefore, as commodities move forward from the raw stage 
to their final manufactured condition their prices include 
more and more of these stabilizing elements, and become 
less violent in their fluctuations. ‘ Control over supply, 
which manufacturers possess in much higher degree than 
primary producers, makes possible the enforcement of defi- 
nite price policies by fabricators. Under these conditions, 
stable prices and variable output are usually found. 

Each of the groups last mentioned contains minor groups 
of commodities with distinct price characteristics. Within 
the raw material group there are marked differences between 

' Cf. Mitchell, “The Making and Using of Index Numbers,” Bulletin S84, 
U. 8. Bureau of l.abor Statistics, 44-45, for examples. 
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agricultural products, animal products, forest products, and 
mineral products. Agricultural products are affected by 
weather and crop conditions as well as by business conditions 
and, though subject to price fluctuations of some magnitude, 
reflect prevailing business conditions less accurately than do 
the prices of mineral products.^ Animal and forest products 
appear to stand between these two with respect to the 
faithfulness with which they reflect business conditions in 
their price movements. Thus, in selecting raw materials 
for inclusion in a sample of price quotations from which a 
representative index number is to be constructed, fair weight 
must be given to these various classes.^ 

Manufactured goods, again, do not constitute a single 
homogeneous group with respect to their price movements. 
In so far as they are to be used for further production, or to 
undergo fiuther manufacture, they resemble raw materials 
in relation to the bidding of competing manufacturers, and 
their prices, therefore, are characterized by relatively wide 
oscillations. In so far as the demand for them is for the pur- 
pose of final consumption, purely business forces have less 
weight, and their prices are more stable. Related to this 
argument is that which has already been presented, the 
increasing stability of prices as the stable elements of wages 
and overhead charges bulk larger in commodity costs. So, 
agaia, the sample price quotations from which an index of 
wholesale prices is to be constructed must include prices 
representative of producers’ and consumers’ goods, of goods 
in the intermediate as well as the final stages of manufacture. 

Other important divisions of the price system exist. The 
behavior of the prices of capital equipment differs from that 
of prices of goods intended for hmnan consumption. The 

It should not be inferred from this that there is no relation between agri- 
cultural production and the prices of agricultural products, and general business 
conditions. The immediate price relation is frequently one of contradictory 
movements, and cycles in agricultural production are not synchronous with 
business cycles. But conditions in these two fields of economic activity are 
mutually related in many ways. 

* CL “The Making and Using of Index Numbers/’ 47. 
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prices of durable goods differ in their fluctuations from the 
prices of perishable goods. Goods imported into a given 
country and goods exported from that country are usually 
subject to the play of different forces. A representative 
index number of wholesale prices should be based upon price 
quotations drawn from all commodity groups marked by 
distinctive modes of behavior, with weight given to each in 
proportion to the relative importance in trade of the com- 
modities in that category. 

Price Comparisons over Time 

In the opening pages of this chapter the fac(. was noted 
that the degree of dispersion found in frequency distributions 
of price relatives depended upon the length of time covered 
in price comparisons. Hence, on statistical grounds, there is 
justification for the conidusion that the accuracy of well- 
constructed price indices is high for measurements extending 
over a short. int,erval, and becomes progressively lower as 
the range of the time comparison increases. This conclusion 
is supported by other considerations. 

In Laspeyres’ formula, 

/ = 

wP(l</0 

the price factor alone varies, as between numerator and de- 
nominator. The const ant, w'eighting factor, go, is assumed to 
define quantities entering into trade in an unchanging system 
of income distribution, living standards, consumption habits, 
etc. This system, for which Sir George Knibbs has used 
the term “regimen,” is taken to be common to the two 
periods compared. If it is constant, and if the g’s which 
define its quantitative attributes are unchanged, then we may 
expect to measure with accuracy the one factor which does 
change — commodity prices. The condition we have here as- 
sumed is the orthodox one of ceteris paribus, the condition that 
factors other than the one subject to study remain constant. 

In fact, of course, the regimen does not remain fixed. 
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Changes in tastes and in consumption habits occur; changes 
in t3T)es of goods used as capital equipment take place; 
incomes shift, and the flow of goods is altered by changes in 
the distribution of buying power among consuming groups ; 
the very price changes that we seek to measure bring altera- 
tions in the demand for given types of goods, and in the quan- 
tities produced. Of no small moment in the total situation 
are the changes that occur in the quality of goods that con- 
tinue to pass by the same trade names. The automobile of 
1938 is the same commodity, by name, as the automobile 
of 1910, but to the average consumer the later model repre- 
sents quite a different bundle of utilities. Similarly, steel, 
textiles, locomotives, even the staple articles of diet have 
undergone important quality changes. A comparison of 
price levels in 1910 and 1938 that depends for its accuracy 
on the assumption that all elements of economic life except 
prices have remained constant is suspect, indeed. 

Our difficulties are not removed if we take as the standard 
of reference the regimen of the second of the two periods 
compared. This is done in Paasche’s formula, 

/ = 

Spog-i 

The system of consumption standards and all that goes 
with it may be of modern vintage in this case, but the 
differences between the regimens of the two periods com- 
pared is just as wide. We have not held constant non-price 
factors, and our measurement of price changes loses in 
accuracy, as a result. 

The method exemplified by the Ideal formula, that of 
employing weighting factors di’awn from both periods, rep- 
resents one attempt at the solution of this problem, but it 
is far from perfect. The use of quantities ^awn from the 
two regimens does not create a common regimen, the indis- 
pensable condition of fuU accuracy in such comparisons. 

The practical procedure in the face of this difficulty is to 
restrict our comparisons, if high accuracy is required, to 



The WhoIiBsalb Price Index op the United States 
Bureau op Labor Statistics 

'Phe authoritative index of wholesale prices in the United 
States is that compiled by the United States Bureau of Labor 
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periods not widely separated in time. Consumption habits, 
living standards and technical production methods will be 
not widely dissimilar in two such periods, and hence the 
number of identical commodities common to the two periods 
will be large. Under these conditions considerable confi- 
dence may be placed in index numbers measuring average 
price changes. Comparison of price levels over longer periods 
may be desired, and may be justified, but the margin of error 
in the measurements may be expected to increase as the time 
span extends. Formal precision in weighting and in the selec- 
tion of acceptable formulas will not provide an escape from 
the unavoidable difficulties arising out of alterations in the 
basic conditions of economic life. Real continuity of in- 
dices covering a stretch of years is possible only on the 
basis of a persisting common regimen. 

These considerations support the claims of an index of 
the chain type, which involves the measurement of price 
changes between successive periods not far apai't in time. 
Bruce D. Mudgett has advocated this procedure. The com- 
parison of price levels in two periods, close together in time 
and with similar regimens, will be accurate, if such an index 
as the Ideal be employed. The elements of such a chain 
may then be linked together, in attempting to measure 
price changes between non-consecutive periods. If the regi- 
mens of the non-consecutive periods differ materially, the 
accuracy of the comparison will probably not be high. But 
it is reasonable to believe that better results will be secured 
by bridging the intervening years in the manner proposed 
than by constructing a single far-flung index based only 
upon the widely dissimilar regimens of two periods far 
removed in time. 


I 
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Statistics. This index was first constructed in 1902, for the 
period beginning with 1890. It was continued until 1913 
as an unweighted average of relative prices, the base of 
each relative being the average price of the given commodity 
during the ten-year period 1890-1899. Various revisions of 
procedure have been made since 1913. As it stands at pres- 
ent the index for any given period (week, month, or year) 
is a weighted aggregate of actual prices, the aggregate being 
expressed, to facilitate comparison, as a relative with 1926 
as the base. 

The index now includes 813 price series. (A single com- 
modity may be represented by several quotations, the prices 
for different grades or in different markets being given. 
Thus for raw cotton there are three quotations, Middling, 
New Orleans; Middling Upland, New York; and Middling 
Upland, Galveston.) In the derivation of the aggregate for 
any date each price quotation is multiplied by a given weight, 
known as a “quantity weight” or a “multiplier.” This 
same weight is applied to the price quotation for the base 
period. The cross products thus obtained for the base period 
and the date in question are values of a stated quantity of 
goods; they differ only in respect of the price factor. The 
following tabulation illustrates the method as applied to 
cotton: 



Average price, 

Quantity 

Average price, 
November, 1937 
X quantity 
weight 

Commodity 

November, 1937 
{per pound) 

tveight 

(pounds) 


Vk 

Qh 

Pkqh 

Cotton, Middling, 
New Orleans 
Cotton, Middling 

$.080 

1,399,496,000 

$111,959,680 

Upland, N. Y. 
Cotton, Middling 

.080 

77,750,000 

6,220,000 

Upland, Galveston 

.077 

6,297,729,000 

484,925,133 


When this process is carried out for the entire 813 price 
series included, the sum of the values in the last column 
gives the index number for the given period, in this case 
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November, 1937. As published, this sum is expressed as a 
relative, the aggregate in 1926 representing 100. ^ The 
formula for the index measuring the level of wholesale prices 
at time “1,” with reference to the base level at time “0” is, 
thus. 

Pi _ 

Po 'Epoqh 

where qh represents the constant multipliers. The method 
of construction renders it possible to shift the base to any- 
desired year or month, changing the given relatives to per- 
centages on the new base. 

This index ninnber, therefore, is based upon the cost at 
wholesale of a bill of goods. The bill of goods remaining 
the same, the total cost changes as the prices of the various 
commodities change, and the index measures the effect of 
these changing individual prices upon the total cost. 

It is essential, of course, that the quantity used as mul- 
tiplier for each series of price quotations truly represent 
the relative importance of the commodity in question. The 
multipliers employed are approximations to the quantities 
actually marketed. Changes are made from time to time in 
these quantities, the revisions being applied, of course, to 
the base period aggregates as well as to the figures for later 
periods. In addition, when it is necessary to substitute one 
price series for a related one that has been discontinued or 
has lost significance, minor modifications are made in the 
multipliers so as to maintain comparability between the 
aggregates for periods preceding and periods following the 
date of substitution.® 

The Bureau of Labor Statistics publishes index numbers 
of wholesale prices for 10 major and 45 minor commodity 

' This base is, at the date of writing, twelve years removed in time. Adop- 
tion of a 1935“* 19S7 base period is now being considered, 

® The method of adjustment is explained in an article^ '‘Revised Method of 
Weidation of the B, L. B. Wholesale Price Index/’ by Jesse M. Cutts and 
J. Bennis^ Journal of the Atnericun Statistical Association, December, 

imt, 
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groups, as well as a general index for all cominodities. The 
major groups include farm products, foods, hides and leather 
products, textile products, fuel and lighting, metals and metal 
products, building materials, chemicals and drugs, house fur- 
nishing goods, and miscellaneous conamodities. The con- 
stituent elements of the index are also classified into raw 
materials, semi-manufactured articles and finished products, 
and measurements of price changes for these groups are 
computed. The National Bureau of Economic Research has 
constructed index numbers for various other categories of 
commodities, utilizing the quotations of the Bureau of Labor 
Statistics. These classes include raw and processed goods, 
durable and non-durable goods, producers’ goods and con- 
sumers’ goods, goods destined for use in capital equipment 
and goods destined for human consumption, foods and non- 
foods, and crops, animal products, minerals, and forest 
products.^ The availability of index numbers for various 
significant classes of goods makes it possible to trace price 
changes with more precision, and to interpret them more 
accurately, than when dependence is placed upon a single 
all-embracing index. For the elements of the price system 
are marked by wide diversity in their behavior over both 
long and short periods of time. 

Other Price Index Numbers 

The measurement of price changes by the use of index 
numbers has not been confined to wholesale prices. Many 
variations of this device have been utilized in measuring 
price movements in other fields. It will be useful at this 
point briefly to indicate the character of some of these 
variations.^ 

^ See JPnces in EeceBdm and Recovery, N. Y., National Bureau of Economic 
Research, 1936, 492-540. 

^ Detailed information concerning the character and content of a wide variety 
of index numbers, price and other, will be found in An Index to Budness Indices, 
Donald H. Davenport and Frances V. Scott, Chicago, Business Publications, 
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INDEX NUMBERS OP RETAIL PRICES 

An index of retail food prices is published currently by 
the United States Bureau of Labor Statistics. The general 
methods employed are similar to those already explained in 
connection with the index of wholesale prices computed by 
that agency, with such differences as inevitably result from 
the nature of the material. 

Actual retail selling prices of 84 articles of food are secured 
biweekly from dealers in 51 representative cities throughout 
the United States. In weighting the quotations on foods 
of a single type (fresh vegetables, for example) in a given 
city, account is taken of the quantities of such foods con- 
sumed by an average wage-earner’s family in that city or, 
for some regions, in the district in which that city lies. In 
obtaining weights consumption by food groups is considered, 
rather than by specific commodities, since the commodities 
actually piiced must be taken to represent similar commodi- 
ties for which no prices are collected. 

The combination for a single city (or geographical area) 
of food prices thus weighted yields an index for that region. 
The food cost index for the United States is computed from 
the aggregates for the 51 cities, each weighted according to 
the population of the area which the city is taken to repre- 
sent. Thus the weights entering the final index of retail 
food prices for the country as a whole represent quantities 
consumed by the average wage-earner’s family, and the 
population assumed to be affected by each series of quoted 
prices. The base of the index numbers, as published, is the 
average of the three-year period 1923-1925. 

The indices of retail food prices, together with index 
numbers of the prices of electricity and coal, at retail, are 
published in the Monthly Labor Review. 

The difficulties inherent in the problem of measuring 
wholesale price movements have been discussed at some 
length. The construction of index numbers of retail prices 
of the type just described presents even greater difficulties. 
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All the theoretical problems arising in the former case are 
to be solved and, in addition, the practical difficulties of 
securing suitable weights, accurate price figures, and com- 
parable quotations are intensified. Because of the lack of com- 
modity standardization, and because of variations in business 
practice and local customs, the latter difficulty is particularly 
acute. For these reasons no index of retail prices at present 
published can be accepted with the confidence with which 
the best indices of wholesale prices may be received. 

INDEX NUMBERS OP THE COST OP LIVING 

If these problems are acute in constructing an index of 
retail prices they are doubly hard to solve in measuring 
such an entity as the cost of living. When food prices, 
rents, retail clothing prices, cost of fuel and light, retail 
furniture prices, and the prices of the other miscellaneous 
items which are included in the budget of the average family 
are to be averaged, and an index number constructed to 
measure variations in the cost of these items, numerous 
statistical difficulties must be overcome. Theoretical ques- 
tions concerning the most suitable methods of averaging 
and weighting present themselves, but more important are 
the practical problems involved in the collection of accurate 
and comprehensive prices and weighting data. 

Two index numbers of the cost of living are currently 
compiled in the United States, .one by the Bureau of Labor 
Statistics, one by the National Industrial Conference Board 
of New York. The former appears in the Monthly Labor 
Review, the latter in periodic publications of the Conference 
Board. In each case the chief items of domestic expenditure 
are weighted in accordance with their relative importance in 
household budgets, and the combined results expressed as 
relative numbers. These are given on the 1913 and 1923-1925 
base by the Bureau of Labor Statistics, on the 1923 base 
by the Conference Board. ^ 

^ For a general discussion of the problem, with details of the Conference 
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INDEX NUMBERS OP PRICE AND BUYING POWER OP 
PARM PRODUCTS 

A set of useful index numbers relating to the prices re- 
ceived by and the prices paid by farmers is compiled by the 
United States Department of Agiicultm’e, The first of these 
is based upon the prices at the farm, as of the middle of 
each month, of 34 major farm products and 13 commercial 
truck crops. The weights employed are the average quan- 
tities marketed by farmers during the period 1924-1929. 
Farmers and agricultural economists have need of such a 
specialized index, because the wholesale prices of farm prod- 
ucts in the great exchanges or in large cities are often poor 
representatives of the prices actually received by farmers. 

The index of prices paid by farmers is compiled quarterly 
(in March, June, September, and December). The constitu- 
ent quotations are retail prices paid by farmers for commod- 
ities used in family maintenance and in production. Weights 
are e.stimated quantities bought by farmers. The base of 
the index of farm prices, as published, is the average of the 
five pre-war years from August, 1909 to July, 1914; that of 
the index of prices paid, 1910-1914. Measurements for sub- 
groups are given, in both cases. 

These two index numbers are used in the derivation of an 
index of the purchasing power of farm products. The com- 
putation of the purchasing power index may be illustrated 
with reference to the figures for 1936. In that year the index 
of prices of farm products was 114. The index of prices paid 
by farmers was 124. That is, the farmer was receiving 
14 per cent more, on the average, for a unit of product than 
in 1909-1914, but the average price paid by him for a unit of 
goo(k purchased was 24 per cent higher than in the base 
period. Therefore the purchasing power of an average unit 
of farm products was 8 per cent less than in 1909-1914 
(114-5-124 =.92). 

Bcmnl procedure, see Coat of Linng in the United States, IS14-19Se, M. Ada 
Be»»y, New York, National Industrial t^bnferenoe Board, 193ti. 
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These three index numbers, for selected years, are given 
in Table 57. 

Table 57 

Index Numbers of Farm Prices, Prices Paid by Farmers, and the 
Buying Power of Farm Products ^ 


(1) 

(2) 

(3) 

(4) 

Average per unit 

' Ymr 

Prices received 

Prices paid 

purchasing power 

by farmers 

by farmers 

of farm products 
(2) 4- (3) 

1910-1914 

100* 

100 

100 

1918 

202 

176 

115 

1920 

211 

201 

105 

1921 

125 

152 

82 

1925 

156 

157 

99 

1929 

146 

153 

95 

1932 

65 

107 

61 

1933 

70 

109 

64 ^ 

1934 

90 

123 

73 

1936 

108 

125 

86 

1936 

114 

124 

92 

1937 

121 

131 

93 


* Aug., 1909-JuIy, 1914 = 100. 

These are significant measurements, yielding valuable in- 
formation concerning the buying and selling relations that 
vitally affect one important group of producers. The devel- 
opment of similar measurements for other groups will add 
materially to our understanding of the changes that shifting 
market relations entail, in the economy at large. Yet the 
limitations of these index numbers should not be overlooked. 
The measurement of prices paid by farmers and, correspond- 
ingly, the measurement of the purchasing power of farm 
products, are subject to the difficulties referred to in the 
discussion of retail prices and living costs. Under existing 
conditions the margin of error in all such measmements is 
fairly wide. The error is the greater, too, the longer the 
time span covered by the quotations. In the present case, 
goods bought by farmers have undergone greater changes 

^ Source: The Agricultural Situation, XJ* S. Bureau of Agricultural Economics, 
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in quality than have the fairly standardized staples that 
the farmer sells. Here, as in all price comparisons over time, 
greater confidence must attach to short-period comparisons 
than to those spanning several decades. 
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CHAPTER VII 


THE ANALYSIS OF TIME SERIES: MEASUREMENT 
OF TREND 

The preceding sections have dealt primarily with frequency 
series and with problems arising in the attempt to organize 
and describe such series. We are now concerned with 
data in the study of which the essential problem is the 
analysis of chronological variations. Such series are of 
major importance in the field of economic statistics, for 
most of the data of economics and business are variables in 
time — as bank clearings, steel production, volume of sales, 
etc. This dominating importance of series in time is not 
found in any other field of statistical research, and the 
development of methods of analysis appropriate to time 
series has come, accordingly, only within recent years with 
the wider adoption of statistical methods in the field of 
economics. 

Problems connected with time series arise both in the 
ordinary routine of internal administration and in the 
analysis of general economic conditions. Sales, pxnchases, 
profits on the one hand, stock prices, interest rates, business 
failures on the other, are variables which fluctuate with the 
passage of time. In the analysis of such series it is generally 
desired that the rate and character of growth be determined, 
and that periodic and accidental fluctuations be isolated 
for study. The sales manager wishes to know how the vol- 
ume of sales is faring, when and why it fluctuates and how 
it compares with volume of production. The economist 
desires to trace the trend of prices, and to scrutinize minutely 
the upward and downward movements of the price level. 
The making of business plans on even a small scale, as 



226 MEASUREMENT OF TREND 


well as the most elaborate schemes of economic forecast- 
ing/must rest upon such study of past trends and fluctua- 
tions, and upon comparison of the movements of related 
series in time. Scientific study of the business cycle is only 
possible through the application of such methods. Oui' 
present task is the development of methods appropriate to 
the analysis of series in time. 

The Pheliminaey Organization op Time Series 

The data of time series usually require less preliminary 
organization than do statistical data which are to be reduced 
to the form of a frequency distribution. The source, pri- 
mary or secondary, from which the figures are taken usually 
presents them in shape for analysis. Certain precautions 
should be observed, however. 

The dates to which the figures apply should be clearly 
understood and definitely stated. Monthly data may be 
as of t.h(; first of each mouth (as for the stock price index 
of the New 1 ork Sl,oek Exchange), averages for each month 
(as for the Bureau of Labor Statistics’ price index), or 
totals for each month (as in the case of figures on cotton 
consumption). They may be cumulative monthly figures, 
each item representing the total for the year to date, as 
in the case of certain coal production data. If average 
figures are given for a month or year it is important to 
know how the average has been secured. 

Again, It is essential that in any time .series there be 
.strict comparability between data for different periods. 
Any attempt to analyze a series that is not homogeneous 
Hi^t be misleading and futile. Yet such series are not 
mfrequently published. Commodity production or con- 
sumption figures published by trade associations and by 
governmental agencies are sometimes based upon returns 
a varying number of reporting concerns. A series 
Of price quotations may lack comparability as between 
difierent dates because of changes in the unit or grade to 
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which the quotations apply, or because quotations are 
drawn frona different markets. Changes in census classifica- 
tions may result in lack of comparability of census data. 
A change in a salesman’s territory may alter his returns 
materially. It is stated that the character of the obligations 
represented by the United States Steel Corporation’s figures 
for “unfilled orders” has varied from time to time. Records 
relating to the physical output of a given commodity in 
different periods may be rendered inaccurate by changes 
in quality or design. These are examples of faults that 
may be found in time series, rendering analysis futile. 
Strict testing is essential before a series be accepted as 
accurate and homogeneous. 

GRAPHIC REPRESENTATION OP TIME SERIES 

Normally the first step to be taken in visualizing a series 
in time and in preparing for further analysis consists of 
plotting the data. The trend and general characteristics 
of a series may be most readily apprehended through 
graphic representation. The data may be plotted on ordinary 
arithmetic or on semi-logarithmic paper. The advantages 
of the latter types for certain purposes have already been 
explained. The choice in a given case will depend upon the 
nature of the data and the object of the study. If interest 
lies in the absolute amount of fluctuations in sales, prices, 
pig iron production or whatever may be in process of 
analysis, or in the comparison of absolute differences between 
series, the ordinary rectilinear chart is to be employed. 
If percentage variations and the comparison of relative 
fluctuations are matters of interest, the semi-logarithmic 
representation is preferable. In general, if one is accustomed 
to the interpretation of this latter type of chart, its use is 
advisable. A clearer, less-distorted presentation of relations 
and a more significant comparison of series are generally 
secured when economic data having time as one variable 
are plotted on paper with a logarithmic ruling on one axis. 
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For some purposes the process of studying series in time 
will have been completed when the data are thus plotted. 
The general trend may be roughly determined from the 
chart. The existence of seasonal and other periodic varia- 
tions may be ascertained. Rough comparisons of trends 
and fluctuations may be made. All the knowledge thus 
secured, it should be noted, will be non-quantitative in 
character, and the comparisons will be tentative and 
approximate. Even so, such charts enable trends and 
relations to be much more clearly visualized than do the 
raw figmes, and for some purposes the knowledge thus 
secured is sufficient, though it lacks precision and accuracy. 
For other purposes more exact measurement and more 
refined analysis are required. Certain appropriate methods 
may be described. 

Fohces Affecting Seeies in Time 

The general object in the analysis of a time series is the 
isolation of the effects of one or more of the forces affecting 
the given series. This may be desired in order that the 
past behavior of the single series may be understood, in 
order that the future behavior of the series may be pre- 
dicted, or in order that two or more series may be compared. 
It is not in any case possible to isolate these effects of 
mchvidual forces with absolute accuracy, and in some cases 
It is impossible even to approximate such a result. But 
©ven figures covering a sufficiently long period, the effects 
of vanous mfluences upon the behavior of a given series 
may usually be measured with some degree of accu- 
racy. 

affect series of data in time? 

1 he forces m any given case may be unique, affecting only 
the given senes, but in general the various influences acting 

ui»n such series may be placed in a limited number of 
categories. 
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SECtrLAE TREND 

In the first place, most series of economic statistics exhibit 
definite trends. Such a trend may be constant in direction, 
may change direction at a constant rate, or may even be 
characterized by abrupt shifts in direction or rate that 
reflect the introduction of novel elements. Thus the volume 
of production or sales of a business house over a period 
of years usually shows a fairly regular growth. The same 
is true of population, the production of basic minerals, 
the number of motor vehicles registered, etc. In some 
cases the rate of growth may be a negative one, as is true 
of interest rates in the United States over the last half 
century. The concept of secular trend (i.e., trend over a 
long period of time) covers both positive and negative 
changes of this type. 

In the analysis of a time series the trend value at any 
date is taken to be the “normal” value at that date. This 
conception of a normal value which may be used as a base 
or point of reference in judging the effects of all forces other 
than the growth factor, is fundamental in economic analy- 
sis. “No other method,” says Carl Snyder, “enables us so 
quickly to set economic events in their just perspective.” We 
should note, however, that such a normal value is essen- 
tially an empirical construction. While useful for purposes 
of reference, and as one of a series of measmements reflect- 
ing secular movements in a given series, it should not be 
assumed to possess any special normative significance. 

The fact should be emphasized that by secular trend is 
meant the smooth, regular, long-term movement of a statis- 
tical series. Frequent and sudden changes either in absolute 
amounts or in rates of increase or decrease are quite incon- 
sistent with the idea of secular trend. It is true that there 
may be occasional changes due to the interjection of a 
new element or the withdrawal of an old factor. But the 
breaking up into numerous sections of the period covered 
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by a time series, and the determination of trend for each 
of these minor periods, does violence to the very concept 
of secular change. 

It does not follow from this discussion that a definite 
rising or declining trend exists for all time series. Many 
series, such as barometric readings at a certain point, 
merely fluctuate about a constant level that does not change 
with the passage of time. 

PERIODIC FLUCTUATIONS 

If the plotted representation of a time series be studied, 
the long-term trend may be discerned in the general upward 
or downward drift, but may not be precisely determined 
by inspection because of the existence of numerous fluctua- 
tions, superimposed upon the trend. These fluctuations 
may be regular or irregular, violent or mild, simple or 
complex. The value of the variable at any given date 
may be thought of as the net resultant of the interaction 
of the secmlar trend and the various forces that tend to 
modify l.he persistent secular movements of a given series. 
(It may be, in fact, that for many series the trend is the re- 
sultant of the interplay of a variety of conflicting forces, 
rather than an underlying movement upon which the peri- 
odic and other fluctuations are superimposed. In the present 
discussion no attempt is made to define the organic relations 
that, may exist among the forces affecting a series in time.) 
These latter forces may be of several types. 

Seaso-nal variations are found in most series of economic 
statistics for which quarterly, monthly, or weekly values 
are obtainable. Consumption and production of commodi- 
ties, interest rates, bank clearings, railroad freight traffic, 
and many other types of data are marked by seasonal 
swings repeated with minor variations year after year. 
These, in so far as they exist at all, are definitely periodic 
lU character, with a constant twelve-month period. Less 
naarkedly periodic, but nevertheless characterized by a 
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considerable degree of regularity, are the cydical fluctuations 
that are found in series affected by forces connected with 
economic or business cycles. Prices, wages, the volume 
of industrial production, trading on the Stock Exchange, 
and most series relating to the activities of individual 
business units are affected by the swings of business through 
alternating periods of depression and prosperity. While 
the length of such periods may vary, the general sequence 
of change has been in the past sufficiently regular to render 
these cyclical movements capable of study. 


EANDOM FLUCTUATIONS 

Entangled with these more or less regular movements 
are the effects of random, accidental, and irregular fluctua- 
tions — catastrophic events such as the San Francisco earth- 
quake, wars, floods, fires, and countless minor events equally 
fortuitous though less violent in the resulting disruptions. 
These events influence the value of a variable at a given 
date, modifying the effects of long-term movements and of 
seasonal and cyclical factors. The observed value at any 
time is the resultant of the play of all these forces. 

The analysis of series in time involves the isolation of 
the effects of these various forces, so far as this is possible. 
A problem may call for the study of but one factor, or it 
may require the complete breaking up of given values. 
When annual data are used the seasonal element will not 
enter, of course. The explanation of methods begins with 
a consideration of problems involving only this type of data. 

The Measurement op Secular Trend 

As an example of the type of material in coimection with 
which these problems arise, the figures in Table 58 on page 
232 may be taken. The values are given in thousands of 
millions in order to simplify the calculations. 

As has been pointed out, the figure for any year, as the 
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Table 58 


value of $162.7 thousands of millions for 1934, is the net 
resultant of the many forces that we have classified under 
headings of secular trend, cyclical variations, and ran- 
dom or accidental fluctuations. Our first problem is to 
measure the secular trend. 

In Fig. 54 the data of New York bank clearings during 
the period 1875-1936, inclusive, have been plotted. A 
definite trend is apparent, together with well marked and 
more or less regular deviations from that trend. Several 
methods are available for arriving at approximations to 
this trend. By employing moving averages an attempt may 
be made to eliminate passing fluctuations and to arrive 
at values that define the influence of the steadily operating 
growth factor. If we assume that a definite functional 
relationship prevails (empirically at least) between the time 
factor and the other variable, an approximation to the 
be secured by fitting an appropriate curve to 
data. Smoothing the data by hand gives some- 


New York Clearing House Transactions, 1875-1936 


(In thousands of millions) 


25.1 

1891 

34.1 

1907 

95.3 

1923 

21.6 

1892 

36.3 

1908 

73.6 

1924 

23.3 

1893 

34.4 

1909 

99.3 

■ 1925 

22.5 

1894 

24.2 

1910 

102.6 

1926 

25.2 

1895 

28.3 

1911 

92.4 

1927 

37.2 . " 

1896 

29.4 

1912 

96.7 

1928 

48.6 

1897 

31.3 

1913 

98.1 

1929 

46.6 

1898 

39.9 

1914 

89.8 

1930 

40.3 

1899 

57.4 

1915 

90.8 

1931 

34.1 

1900 

52.0 

1916 

147.2 

1932 

25.3 

1901 

77.0 

1917 

181.5 

1933 

33.4 

1902 

74.8 

1918 

; 174.5 

1934 

34.9 

1903 

70.8 

1919 

214.7 

1935 

30.9 

1904 

59.7 

1920 

252.3 

1936 

34.8 

1905 

91.9 

1921 

204.1 


37.7 

1906 

103.8 

1922 

213.3 



214.6 
235.5 

276.9 

293.4 

307.2 

368.9 

456.9 

399.5 

287.7 

177.3 

154.6 

162.7 

174.4 

186.5 


1875 

1876 

1877 

1878 

1879 

1880 
1881 
1882 

1883 

1884 

1885 

im 

1887 

1888 

1889 

1890 


450 - - t -1- 1 1 Li-i I . t 1 » 

— Original data 

400 — — Three year moving average 
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mative and empirical in character. In certain studies it 
has been found possible to use one statistical series as base 
or trend line for another series of homogeneous data. 


Moving Averages 

When a trend is to be determined by the method of 
moving averages, the average value for a number of years 
(or months, or weeks) is secured, and this average is taken 
as the normal or trend value for the unit of time falling 


New York Clearing House Transactions, 1912-1936, and 3-, 5-, 7-, 
and %-Y ear Moving Averages 
(In thousands of millions) 

y Origuuil Three-year Five-year Seven-gear Nine-year 

^ data moving av. moving av. moving mi. moving av. 

1912 S 96.7 

1913 98,1 S 94.87 

1914 89.8 92.90 $104.52 

1915 90.8 109.27 121.48 $125.51 

1916 147.2 139.83 136.76 142.37 $149.51 

1917 181.5 167.73 161.74 164.40 161.44 

1918 174.5 190.23 194.04 180.73 174.24 

1919 214.7 213.83 205.42 198.23 188.11 

1920 252.3 223.70 211.78 207.86 204.19 

1921 204.1 223.23 219.80 215.57 218.60 

1922 213.3 210.67 223.96 230.20 231.03 

1923 214.6 221.13 228.88 241.44 245.78 

1924 235.5 242.33 246.74 249.29 262.91 

1925 276.9 268.60 265.52 272.83 285.64 

1926 293.4 292.50 296.38 307.63 307.36 

1927 307.2 323.17 340.66 334.04 315.62 

1928 368.9 377.67 365.18 341.50 311.48 

1929 456.9 408.43 364.04 327.27 302.49 

1930 399.5 381.37 338.06 307.44 289.80 

1931 287.7 288.17 295.20 286.80 276.58 

1932 177.3 206.53 236.36 259.01 263.17 

1933 154.6 164.87 191.34 220.39 

1934 162.7 163.90 171.10 

1935 174 4 174 53 

1938 186.5 
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at the middle of the period covered in the calculation of 
the average. Table 59 shows the results secured when three-, 
five-, seven-, and nine-year moving averages are thus 
computed for New York clearings for the period 1912-1936. 

The three-year moving average for 1916 is the average 
of the figures for 1915-16-17, the five-year figure for 1916 
is the average of the years 1914-15-16-17-18. The other 
averages are computed in the same way. In each c^e the 
average is centered for the period included ; that is,^ the 
average is taken to represent normal as of the middle 
of the given period. The employment of an odd nimber 
of years simplifies this centering process, though it is not 
essential that the number be odd. With an even mnnber 
of years, the figure may be centered by taking a two-year 
moving average of the first moving average. The three- 
and nine-year moving averages for the entire period are 
plotted, with the original data, in Fig. 54. 

It is obvious that the effect of the averaging is to give a 
smoother curve, lessening the influence of the fluctuations 
that pull the annual figures away from the general trend. 
The longer the period included in securing each average, 
the smoother is the curve secured, though there are other 
factors to consider in deciding upon the length of the period. 
Certain of these factors may be noted. 

CHARACTEEISTICS OF MOVING AVERAGES 

Given cyclical fluctuations about a uniform level or about 
a line ascending with a uniform slope, the length of the 
cycle and the magnitude of the fluctuations being constant, 
a moving average having a period equal to the period of 
the cycle (or to a multiple of that period) will give a straight 
line, a perfect representation of the trend. Under the 
same conditions a moving average having a period greater 
or less than the period of the cycle will give, not a straight 
line, but a new cycle having the same period as the original, 
but with fluctuations of less magnitude. The minima and 
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maxima of the cycles thus obtained will not necessarily 
coincide with the minima and maxima of the original cycles. 
In general, when such a new cycle is obtained the magnitude 
of the fluctuations will be less the longer the period on 
which the average is based. ^ 

These propositions may be illustrated by the figures in 
Table 60, arbitrarily chosen. In the first example five 
figures have been selected which repeat themselves in 
sequence, fluctuating about a common level. 

The moving averages in columns (2) and (3) represent 

Table 60 

lllmtmivng the Application of Moving Averages 


Moving average 
of 10 items 
{centered) 


Moving average 
of 8 items 
(centered) 


Cyclical Moving average 
data of 5 items 


Moving average 
of 3 Herns 


(The items in columns (3) and (5) have been centered by means of a 
moving average of 2 items.) 

* The decrease in the magnitude of the fluctuations is not regular, however 
but cycUeal. 
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the data with the cycles completely removed. When the 
period of the average is not equal to the period of the cycle, 
or to a multiple of that period, the cycle is not removed, 
as is apparent from the figures in columns (4) and (5). ^ 
The conclusions suggested above hold when the cyclical 
fluctuations take place about any straight line. In Table 61 
the foregoing data have been employed but with a constant 
increment of 3. This is equivalent to superimposing the 
same cycles upon a line with a slope of + 3. 

Table 61 


Illustrating the Application of Moving Averages to a Series with 

Linear Trend 


(D 

(2) 

(3) 

(4) 

(5) 

Cyclical Moving average 

Moving average 
of 10 items 

Moving average 

Movning average 
of 8 items 

data 

of 5 items 

{centered) 

of 3 items 

{centered) 

2 

9 



81 


14 

m 


14 


19 

15i 


16-1 


17 

181 


17| 

18i 

17 

21i 

211 

191 

21H 

24 

24i 

241 

231 

24f 

29 

271 

271 

29 

26% 

34 

301 

301 

31 1 

29H 

32 

331 

331 

32§ 

331 

32 

361 

361 

341 

36» 

39 

391 

391 

381 

391 

44 

421 

421 

44 

411 

49 

451 

451 

46i 

A A 11 

47 

481 

481 

471 

48i 

47 

611 


491 

5111 

54 

541 


531 


59 

571 


59 


64 



611 


62 





(The items in columns (3) and (5) have been centered by means of a 


moving average of 2 items.) 


The trend values, with the effect of the cycles completely 
removed, are secured by taking moving averages equal in 
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period to the cycle or to a multiple of that period. The 
cycle persists, with the same period but with diminished 
amplitude, when the average is based upon a period not 
equal to that of the cycle, as is clear from the figures in 
columns (4) and (5). 

When these ideally simple conditions of constant period 
and amplitude do not exist, the moving average becomes 
more ambiguous and its interpretation less simple. If the 
period of the cycle varies, the selection of a period for the 
moving average is more difficult. In general, a period 
equal to or greater than the average length of the cycle 
is to be selected. An average having a shorter period will 
give a line that is marked by pronounced cycles, these 
cycles being reduced as the period covered in the calculation 
of the average increases. 

WTien the amplitude of the cycle varies, the period being 
constant, a moving average with a period equal to the 
length of the cycle will give a line of trend marked by 
minor cycles. The amplitude of these secondary cycles 
will be a minimum when the period of the average is equal 
to the period of the cycle (or to a multiple of that period). 
When these last two irregularities are combined, and the 
data are characterized by cycles of varying amplitude and 
of varying length, the moving average giving the most 
effective representation of the trend is that which has a 
period equal to the average length of the cycle, or to a 
mxiltiple of that length. 

A new factor enters when the trend departs from line- 
arity. If the underlying trend of a series is concave upward, 
a moving average will always exceed the actual trend value; 
if the reverse is true, and the trend is convex upward, a 
moving average will always be less than the actual trend 
value. 

These conditions are depicted in the following examples. 
The figures in Table 62 give the values secured when a 
cyde of constant period and amplitude, as in col. (3), is 
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superimposed upon a line of trend that is concave upward, 
i.e., increasing at a constantly increasing rate. If the mov- 
ing average could completely eliminate the effects of the 
cycle, the values secured from the average would be equal to 
the average value of the five items in each cycle (6.2) plus 
the values of the function y = x\ given in col. (2). 

Table 62 

Illustrating the Application of Moving Averages to a 
Non-Linear Series 


(Increasing rate) 


( 1 ) 

(2) 

(3) 

(4) 

(5) 

(6) 






True trend 



Cyclical 

CoL (2) plus 

Moving average 

values 

X 

X 

data 

col (3) 

of 5 items 

C/vvv C'VOO 

+ 6.2) 

0 

0 

2 

2 



1 

1 

6 

7 



2 

4 

8 

12 

12.2 

10.2 

3 

9 

10 

19 

17.2 

15.2 

4 

16 

5 

21 

24.2 

22.2 

5 

25 

2 

27 

33.2 

31.2 

6 

36 

6 

42 

44.2 

42.2 

7 

49 

8 

57 

57.2 

55.2 

8 

64 

10 

74 

72.2 

70.2 

9 

81 

5 

86 

89.2 

87.2 

10 

100 

2 

102 

108.2 

106.2 

11 

121 

6 

127 

129.2 

127.2 

12 

144 

8 

152 

152.2 

150.2 

13 

169 

10 

179 

177.2 

175.2 

14 

196 

5 

201 

204.2 

202.2 

15 

225 

2 

227 

233.2 

231.2 

16 

256 

6 

262 

264.2 

262.2 

17 

289 

8 

297 

297.2 

295.2 

18 

324 

10 

334 



19 

361 

5 

366 



The values of the moving average are, in this case, in 


excess of the true trend values, a form of distortion that 


will always occur with a series of this type. 

In Table 63 are shown the results of superimposing the 
same cyclical values upon a line of trend that is convex 
upward, i.e., increasing at a constantly decreasing rate. 
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In this case, a perfect method of eliminating the cycles 
would give results equal to the average value of the five 
items (6.2) plus the values of the function y = Vx. 

In this case the moving average values are consistently 
too low. The discrepancy is greatest for the lower values 
of X, as the decrease in the rate of growth is most marked 
for these values. 

Table 63 


lUustrating the. Application of MoviTig Averageft to a 
Non-Linear Series 
(Decreasing rate) 


CD 

(2) 

(3) 

(4) 

(5) 

(6) 






True trend 


■t / 'Y 

Cyclical 

Col. (2) pkuH 

Moving average 


Ji. 

V 

data 

col (3) 

of 5 its7ns 

ViAflf (it/o 

Wx + 6.2) 

0 

0 

2 

2.00 



1 

LOO 

6 

7.00 



2 

1 41 

8 

9.41 

7.428 

7.61 

3 

1 . 73 

10 

11.73 

7.876 

7.93 

4 

2.00 

5 

7.00 

8.166 

8.20 

0 

2.24 

2 

4.24 

8.414 

8.44 

6 

2.45 

6 

8.45 

8.634 

8.65 

7 

2.65 

8 

10.65 

8.834 

8.85 

8 

2.83 

10 

12.83 

9.018 

9.03 

9 

3 00 

5 

8.00 

9 192 

9.20 

10 

3.16 

.2 ■ 

5.16 

9.354 

9.36 

11 

3.32 

6 

9.32 

9.510 

9.52 

12 

3.46 

8 

11.46 

9.658 

9 66 

13 

3.61 

10 

13.61 

9.800 

9.81 

14 

3.74 

5 

8.74 

9.936 

9.94 

15 

3.87 

2 

5.87 

10.068 

10.07 

16 

4.00 

6 

10.00 

10.194 

10.20 

17 

4.12 

8 

12.12 

10.318 

10.32 

IS 

4.24 

10 

14.24 



19 

4.36 

5 

9.36 




Considerations previously reviewed have indicated that 
a moving average should, in general, be based upon a 
period at least equal to the period of the cycle, and prefer- 
ably equal to some higher multiple of that period when the 
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data are at all irregular. The longer the period covered, 
the greater the stability of the average. But when the 
underl 3 dng trend departs materially from the linear form, 
following a cxirve bending upward or downward, the error 
involved in the use of any moving average increases as 
the period of the average increases. If a moving average 
is used in such a case to measure the trend, the period of 
the average should be the shortest which will serve to 
average out the cycles; equal, that is, to the average length 
of one cycle. 

In practice, however, these various conditions are found 
in complicated combinations. The fact that cycles vary in 
amplitude and length calls for a moving average based 
upon a fairly long period. The fact that the trend of the 
data is usually non-linear calls for a short period average 
to lessen the upward or downward distortion. A considera- 
tion of some importance in practical work is that a moving 
average can never be brought up to date. The lag is less, 
of course, the shorter the period covered by the average. 
The selection of a period in a given case must rest upon a 
study of the actual data with these various considerations 
in mind. 

It has been assumed in the preceding discussion that the 
purpose of the moving average is the representation of 
secular trend. The moving average may be used, also, in 
smoothing data for the purpose of eliminating random 
fluctuations. For this purpose a moving average based 
upon a period shorter than the average length of the cycle 
should be selected. 

We may return now to the problem relating to New York 
bank clearings. A study of the lines marked out by the 
different moving averages in Fig. 54 reveals significant 
differences between them. The three-year average follows 
the graph of the original data most closely, as would be 
expected. The nine-year average marks out the smoothest 
line of trend, but, on the other hand, departs most widely 
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from the data. This is particularly noticeable from 1893 to 
1898, from 1911 to 1915, from 1921 to 1926, and from 1927 
to 1931. It is due to the pronounced changes in the rate 
of growth of the series during these periods. Except for 
these distortions the general trend seems to be most accu- 
rately represented by the nine-year average. 

In determining the relative merits of the different moving 
averages we are aided by a knowledge of the course of 
business during the period covered. The volume of New 
York bank clearings is a sensitive index of general business 
conditions, responding immediately to changes in specula- 
tive and industrial activity. Major and minor business 
cycles are reflected in this series- Knowing the number of 
cycles through which business has passed during the period 
1875-1936, we may determine which of the moving averages 
serves best as a standard from which to measure cyclical 
deviations. In this case we are practically working back- 
ward from a known result, a method not always available. 

If we take as a starting point in each cycle the year in 
which revival began, after recession, the following cycles 
in general business activity may be distinguished:* 

1871-1879 1908-1912 

1879-1885 1912-1914 

1885-1888 1915-1919 

1888-1891 1919-1921 

1891-1894 1921-1924 

1894-1897 1924-1927 

1897-1900 1927-1933 

1901-1904 1933- 

1904-1908 

The cycles marked out by the three-year moving average 
are too numerous to enumerate. In fact, the deviations 
from this average are primarily accidental and minor 

‘ These dates are based upon the chronology of American business cycles 
developed by Wesley C. Mitchell; cf. “Production during the American Busi- 
ness C^cle of 1927-1933,” by Wesley C. Mitchell and Arthur F. Bums, Bid- 
iMin 8i, National Bureau of Economic Research, November 9, 1936. It should 
w noted tibat the chronidogy is based upon monthly data, whereas the Clearing 
Hease data dted in the t^ are annual figures. 
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fluctuations and should not be classed as cycles. Deviations 
from the five-, seven-, and nine-year averages mark out the 


following cycles: 
Cycles of denations 

Cycles of deviations 

Cycles of deviations 

from five-year 

from seven-year 

from nine-year 

moving averages 

moving averages 

moving averages 

1879-1885 

1879-1885 

1879-1885 

1885-1888 

1885-1888 

1885-1888 

1888-1891 

1888-1894 

1888-1897 

1891-1897 

1894-1900 

1897-1900 

1897-1900 

1900-1904 

1900-1904 

1900-1904 

1904-1908 

1904-1908 

1904-1908 

1908-1911 

1908-1915 

1908-1911 

1911-1915 

1915-1923 

1911-1915 

1915-1918 

1923- 

1915-1918 

1918-1923 


1918-1924 

1923-1927 


1924-1927 

1927-1932 


1927-1932 

1932- 


1932- 




Some of the differences between the series of cycles thus 
determined and the reference cycles distinguished by 
Mitchell are doubtless due to the distinctive behavior of 
New York clearings. Other differences reflect the peculiari- 
ties of moving averages. Deviations from the five-year 
averages between 1879 and 1927 show one more cycle than 
we find in the series based on seven-year averages, four 
more cycles than are shown by the nine-year averages. 
And yet the deviations from five-year averages fail to show 
the cycles of 1894-1897 and of 1921-1924. The nine-year 
averages reveal only eight cycles between 1879 and 1927, 
as against Mitchell’s fourteen reference cycles. Mitchell 
was working, of course, with monthly data which are 
more sensitive than annual data to cyclical forces. More- 
over, he was dealing with relatively short movements, some 
of which appear as only minor fluctuations in general business 
activity. 

If interest attaches to the shorter swings of business, 
to cycles with average durations of three or four years, 
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a moving average of relatively short period should be used. 
A five-year average is appropriate. Averages of longer 
period define trend movements more faithfully, but may 
fail to reveal fluctuations properly classified as business 
cycles. We should refer, however, to recent attempts to 
establish the reality of long cycles, of nine, eleven, or as 
many as thirty years in average duration. In the study 
of such cycles moving averages of corresponding periods 
would be employed. 

In general, the moving average has the prime advantage 
of flexibility. The representation of secular trend by mathe- 
matical curves frequently involves the breaking up of a 
period into two or three subdivisions, and the fitting of 
separate curves to each. This results from changing condi- 
tions and sharply changing rates of growth or decline. 
Where su(?h changes occur the moving average has the 
merit of flexible adaptation to the new conditions and is 
often a more effective measure of secular trend than curves 
fitted with great labor. 

Simple and weighted moving averages, in varying com- 
binations, have wide uses in the analysis of economic time 
series. An illuminating discussion of these uses, and of the 
procedures appropriate to different purposes, is to be found 
in The Smoothing of Time Series, by Frederick R. Macaulay.' 

Representation op Secular Trend by Mathematical 

Curves 

For many types of data the secular trend may be repre- 
sented by a mathematical curve rather than by a line 
based upon a moving average. Thus, if the growth (or 
decline) is by constant absolute increments (or decrements) 
a straight line will serve as an exact representation of the 
trend. Or the growth may be by constant percentages, 
as in the case of capital increase, when a principal sum 
increases in accordance with the compound interest law. 

* National Bureau of Economic Research, New York, 1931. 
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A curve of a definite mathematical form furnishes the best 
-presentation of this trend. In many series of economic 
statistics the data seem to conform to definite laws of 
OTOwth or decline, and where this is the case the task of 
Lalvsis, interpretation, and projection is materially ^sisted 
by securing a mathematical expression for the underlying 
law. In practically all cases, of course, there are departures 
from this law, deviations above and below the line of secular 
trend. These deviations, however, do not destroy the value 
of an equation that describes the underlying law of develop- 

There is one fundamental difference between the moving 
average as a measure of trend and such mathematic^ 
curves. The former implies no definite “law” to whicl 
the data are assumed to conform. It is based upon the 
data as given; if the general trend changes, the moving 
average follows the new trend. It is a flerible measure o. 
trend, adapting itself to changing conditions, purportmj 
to be nothing more than an empirical approximation t( 
the drift of the series. Mathematical curves fitted to^ eco 
nomic series are, in fact, nothing more than empirica 
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would not be homogeneous, and a single equation for the 
trend during the whole period should not be seciu-ed. 

In the practical approach to a problem involving the 
determination of secular trend the first task is the selection 
of the appropriate type of curve. This is perhaps the most 
difficult part of the work; certainly it is the part in which 
the element of personal judgment enters most directly. 
For there is no objective rule to follow, no fixed standard 
by which the most appropriate curve may be selected. 
Something more will be said on this subject after the 
characteristics of the chief types of curves and the methods 
of fitting them have been described. For the present it 
may be assumed that a curve similar to one of the types 
described in Chapter II, or to a related form, has been 
selected, and that we face the practical task of fitting it to 
the data. 

FIITING A STRAIGHT UNE; THE METHOD OP LEAST SQUARES 

If the data, when plotted, show a trend that can best 
be represented by a straight line the task of fitting is 
merely the determination of the constants in an equation 
of the form y = a + bx. The values of a and b which 
will give a line following most closely the trend of the 
data are to be obtained. A simple illustration may serve 
to demonstrate the various methods which may be employed. 
Nine points (1, 3; 2, 4; 3, 6; 4, 5; 5, 10; 6, 9; 7, 10; 8, 12; 
9, 11) are plotted in Fig. 55. Our problem is the fitting of 
a straight line to these points. 

By inspection approximate values of a and b may be 
determined. A thread may be stretched through the 
points in such a direction that it seems to follow the trend 
as closely as possible. The slope of the line thus laid out 
may be measured, the ^-intercept determined, and the 
desired equation thus approximated. Obviously this is a 
loose and uncertain method, and the results obtained by 
different individuals may be expected to vary rather widely. 
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There is one and only one straight line that fits the plotted 
data most accurately. The constants for this fine of best 
fit may be determined by the method of least squares. 

The theory upon which the method of least squares is 
based need not be detailed at length here. The argument 
may be briefly presented: A number of observation values 
of a certain quantity are found, and it is desired to obtain 
the most probable value of the quantity which is being 
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Fig. 55. — Illustrating the Fitting of a Straight Line to Nine Points 

measured. It is capable of demonstration that the most 
probable value of the quantity is that value for which the 
sum of the squares of the residuals is a minimum. (The 
“residual” is a term for the difference between a given 
estimated value and an observation value.) This is true 
of the arithmetic mean of the observation values. Thus, 
if a given distance be measmed by a number of individuals, 
with varying results, the most probable value is the arith- 
metic mean of the different measurements. The process 
of computing the mean involves the following steps, which 
are enumerated for the purpose of simplifying the later 
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explanation. We seek a result, a statement of the most 
probable value of the distance being measured, which will 
take the form: 

Jif = (a constant). 

Let us say we have three approximations to this value; 

Af= 5,672 feet 
M= 5,671 feet 
M = 5,676 feet 
adding, 3M = 17,019 feet. 

Since there is but one unknown, M, it may be derived 
directly from this equation, and we have 

M= 5,673 feet. 

This is the value for which the sum of the squares of the 
deviations is a minimum. 

A similar problem arises when the relation between two 
variables is being measured. Our goal in this case is the 
equation that correctly describes this relationship. We 
have secured, however, varying results which do not agree 
precisely as to the constants in the equation of relationship. 
In other words, our plotted points do not all lie on the 
same line. What are the most probable values of the con- 
stants in the required equation? The answer is analogous 
to that given when a single quantity was being measured. 
We seek the constants which, when the resulting equation 
is plotted, will give a line from which the deviations of 
the separate points, when squared and totaled, will be a 
minimum. Assuming that each pair of measurements gives 
an approximation to the true relationship between the 
variables, we wish to find the most probable relationship, 
and this is given by the line for which the sum of the squared 
de\dations is a minimum. 

We have, in the present example, nine pairs of values for 
X and y. Substituting these values in the generalized form 

^ Gf . Apj>endix A for a more detailed discussion of the method of least squares, 
tc^ether with a description of certain checks upon the calculations. 
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of the linear equation, y = a + bx, we secure the following 
observation equations: 

3 = o -j" Ih 

4 = o + 2b 

6 = a + 36 

5 = a + 46 . 

10 = a + 56 

9 = ct -f- 66 

10 = a + 76 

12 = Cl + 86 

11 = a + 96. 

Any two of these equations could be solved as simultaneous 
equations, and values of a and 6 secured. But these values 
would not satisfy the remaining equations. Our problem 
is to combine the nine observation equations so as to secure 
two normal equations, which, when solved simultaneously, 
will give the most probable values of a and b. The first 
of these normal equations is secured by multiplying each of 
the observation equations by the coefficient of the first 
unknown (a) in that equation, and adding the equations 
obtained in this way. Since the coefficient of a in the present 
case is 1 throughout, the nine observation equations are 
unchanged by the process of multiplication. The second 
of the normal equations is secured by multiplying each 
of the observation equations by the coefficient of the second 
unknown (6) in that equation, and adding the equations 
obtained. Thus the first equation is multiplied throughout 
by 1, the second by 2, and so on. The process of securing 
the two normal equations is illustrated in Table 64 on 
page 250. 

The two normal equations are 

70 = 9a + 456 
418 = 45a + 2856. 

It remains to solve these equations for a and 6. By multi- 
plying the first equation by 5 and subtracting it from the 

68 

second, a may be eliminated; a value of or 1.133, is 
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Table 64 

Derivation of Normal Equations from Observation Equations 


3 

8 

18 

20 

50 

54 

70 


1(X 4” 

2<x -j- 
3a ~f* 
4a + 
5a + 
6a 4“ 
7a + 
8a 4" 
9a 4“ 


16 

46 

96 

166 

256 

366 

496 

646 

816 


70 = 9a 4- 456 


418 = 45a 4- 2856 


found for b. Substituting this value in either of the equa- 
tions, a value of 2.111 is secured for a. The equation to 
the best fitting straight line is, therefore, 

y = 2.111 1.133a:. 

In the actual application of the method it is not necessary 
to write out and total the equations, as is done above. 
We need only insert the proper values in the two equations,* 

S(y) = na + 62(a:) 

Sixy) = a2(a-) + hZ{x^). 

The symbols employed have the following meanings: 

'Z{y): the sum of the values of y. 

S(x) ; the sum of the values of x. 

"Sixy) : the sum of the products of the paired x’s and y’s. 

S(x®) : the sum of the squares of the values of x. 

n: the number of pairs of values; the number of points 
plotted. 

The work of computation is facilitated by a tabular 
arrangement similar to that shown in Table 65. 

The two desired normal equations are secured by sub- 
stituting these five values in the type equations given 
above. It will be noted that the results are identical with 
those obtained from the observation equations. 

‘General rules for the formation of normal equations are given in Ap- 

iwdix A. 
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Table 65 

Computation of Values Required in Fitting a Straight Line 


X 

y 

^y 



1 

3 

3 

1 

n = 9 

2 

4 

8 

4 

S(x) = 46 

3 

6 

18 

9 

S(2/) = 70 

4 

5 

20 

16 

S(a;2) = 285 

5 

10 

50 

25 

S(a^) = 418 

6 

9 

54 

36 


7 

10 

70 

49 


8 

12 

96 

64 


9 

11 

99 

81 


45 

70 

418 

285 



When the equation to the best fitting straight line has 
been obtained the values of y corresponding to given values 
of X may be computed and compared with the observed 
values. Table 66 presents the results secured : 

Table 66 


Comparison of Observed and Computed Values of a Variable Quantity'- 


X 

y 

{observed) 

y 

(computed) 

d 


xd 

1 

3 

Z.2i 

- .21 

.0597 

- .24 

2 

4 

4.31 

- . 34 

.1427 

- .74 

3 

6 

5.51 

+ .41 

.2390 

+ 1.41 

4 

5 

6.61 

- 1.64 

2.7041 

- 6.54 

5 

10 

7.74 

+ 2.24 

4.9381 

+ 11.14 

6 

9 

8.94 

+ 

.01 

.0079 

+ .54 

7 , 

10 

10.01 

- .04 

.0020 

— . 34 

8 

12 

11.14 

+ 

.84 

.6760 

+ 6.54 

9 

11 

12.34 

- 1.3i 
0.0 

1.7190 

10.4885 

- 11.8 
0.0 


The sum of the deviations of the plotted points from the 
line is zero. The sum of the deviations when each is multi- 
plied by the corresponding value of x is also zero. The 
accuracy of the actual calculations involved in fitting may 

^ The common fractions are retained in certain columns in order that the sum 
of the deviations may be exactly zero. 
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be tested in this way. The sum of the squares of the devia- 
tions, 10 . 4885, is a minimum. Any change in the value of 
a or b would give a line from which the sum of the squared 
deviations would exceed 10 . 4885. 


PITTING A STRAIGHT LINE; SPECIAL CASES 

The simultaneous solution of the two normal equations 
will give, in any case, the most probable values of a and b. 
The processes of calculation may be simplified in certain 
special cases, not infrequently encountered in handling 
economic data. If the x’s are consecutive numbers, as they 
always are when an unbroken time series is plotted, the 
origin may be taken at the median value. When the number 
of observations is odd this will be the middle item, of 
course. The value of S(a:) will then be zero, and the normal 
equations become 


S(:ry) = 6S(*=). 

Thus if a time series extends, by years, from 1901 to 1937, 
the origin may be taken at 1919, the value of x corresponding 
to 1918 being — 1, to 1920, + 1, and so on. The solution 
for values of a and b is rendered much easier when the 
data may be disposed in this way. When there is an even 
number of years the same process is possible, time (the 
x-variable) being measured in units of one half year. 

Again, when the values of x are consecutive positive 
numbers starting at zero, the values of S(x) and of S(a:') 
may be easily determined. The sum of the first n natural 
n(fi -j- 1) 

numbers is equal to 2 Thus the sum of the numbers 

5(5 + 1) 

from 1 to 5 is - — ^ — "> or 15. ‘This term may replace S(r) 
in the normal equations. Similarly, the sum of the squares 

of the first n. natural numbers is equal to g- 

Thus the sum of the squares of the numbers from 1 to 5 
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is equal to ^ = 55. This expression may replace 

2 (a:^) in the normal equations, and we have 

S(j) - 1.0 + t 

sw . + 5(?2l+fl±^). 

It is sometimes easier to work from equations in this form 
thpn in the form first given. The data for time series may 
be handled in this way, the years being numbered consecu- 
tively, beginning with 1. 

FITTING A CURVE OF THE POWER SERIES 

The discussion above has been confined to the case of 
linear trend. Such a type of curve frequently gives an 
excellent fit, but in many cases it fails accurately to fit the 
data. This difficulty is sometimes overcome in practice 
by breaking a series into segments and fitting a separate 
line to the data for each of these periods. Where there is 
an actual break in the series, the period as a whole lacking 
homogeneity, this practice may be justified, but when the 
period is essentially homogeneous the whole concept of 
secular trend is violated by this process of subdividing 
and fitting separate lines. In many cases where a straight 
line will not fit, a curve of the power series may represent 
the trend accurately. The general process of fitting such a 
curve may be briefly described. 

The generalized form of the equation of the type desired 
is y = a + bx + cx^ + dx^ + ... . An equation of this 
form does not, of course, represent a curve of the parabolic 
type, but in ordinary usage that term is applied to the 
potential series. If carried to the second power of x it is 
called a second degree parabola; if to the third power, a third 
degree parabola, etc. For ordinary purposes such a curve 
should not be carried beyond the second or third power of x. 



MEASUREMENT OF TREND 


If carried to the second power there are three unknowns, and 
three normal equations must be solved simultaneously in 
securing the required values. 

The procedure is similar to that outlined for the linear 
case. Each observation equation is multiplied by the 
coefficient of the first unknown in that equation, and the 
resulting equations are totaled to give the first normal 
equation. The process is repeated for the two other un- 
knowns, and the three normal equations thus obtained are 
solved for «, b, &nd c. The results are the most probable 
values of these three constants. The following are the 
general forms which the three normal equations take; 

2 ( 2 /) = w + 62(a:) + c2(x^). 

2 (x 2 /) = a2(x) + bS(x^) -b c2(ar®). 

2(x=y) = aS(x^) -f bZ(x^) -j- c2(x0. 

As an example of the process, the calculations involved 
in fitting a second degree parabola to the points 1, 2; 2, 6; 
3, 7; 4, 8; o, 10; 6, 11; 7, 11; 8, 10; 9, 9 may be outlined. 
It is of the greatest practical importance in curve fitting, 
as in all extensive calculations, that the work be laid out 
and carried on in a definite and systematic fashion, with 
each step definitely related to the preceding and succeeding 
operations. Checks should be introduced wherever possible, 
as mathematical errors creep into even the most careful 
work. A tabular arrangement is generally advisable, each 
operation being revealed and each set of results clearly 
presented. The data in the present problem may be ar- 
ranged as in Table 67. 

When the a:’s are consecutive integers beginning with 1, 
as in the present case, the values of 2(x), 2(x*), 2(x®), 
and 2(x«) may be secured from prepared tables.^ 
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Table 67 

Compuiation of Values Required in Fitting a Second Degree Parabola 


X 

y 

xtj 


x^y 


1 

2 

2 

1 

2 

n = 9 

2 

6 

12 

4 

24 

Xix) = 45 

3 

7 

21 

9 

63 

S(a:2) = 285 

4 

8 

32 

16 

128 

S(x3) = 2,025 

.5 

10 

50 

25 

250 

S(a:^) =15,333 

,6 

11 

66 

36 

396 

m = 74 

7 

11 

77 

49 

539 

l^ixy) = 421 

8 

10 

80 

64 

640 

l^ix^j) = 2,771 

1, 

9 

81 

81 

729 


45 

74 

421 

285 

2,771 



Substituting these values in the equations given above, 
the following normal equations are secured: 

74 = 9a + 456 + 285c. 

421 = 45a + 2856 + 2,025c. 

2,771 = 285a + 2,0256 + 15,333c. 

When these equations are solved simultaneously the 
following values are secured for the three constants : 

a =-.929. 

6 = +3.523. 
c = - .267. 

The equation of the desired curve is 

y = - .929 + 3.523a: - .267x\ 

This curve and the nine given points are plotted in Fig. 56 
on page 256. 

If the values of x are consecutive, as in the present 
example, the work of computation is lightened if the mid- 
value is taken as origin. In this case S(a;) and S(a;®) are 
equal to zero, and the normal equations become 

Sy = na + cS(a:^). 

S(a:y) = 6S(a:2). 

= aS(a:2) + cS(a:^). 
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When a third degree parabola of the form y = a + bx 
+ cx^ + dx^ is to be fitted to data, four constants must be 
determined, and four normal equations are necessary. 
These are of the following form: 

2 ( 2 /) = na + 62(x) + cSix^) + d2(x«). 

Xixtj) = a2(x) + 52(x2) + c2(x'*) + d2(x^). 

^(x^y) = a2(x2) + 62(x®) + c2(x0 + dS(x*). 

2(xV) = aS(x*) + b2ix*) + c2(x») + cECxf-). 

The solution for four or more constants involves a con- 
siderable amount of arithmetical calculation, and there is 


Fi< 3 . 56. — Illustrating the Fitting of a Second Degree Curve to Nine 

Points 

some question as to the advisability of representing secular 
trend by equations of this type. With a sufficient number 
of constants a curve may be fitted which will follow every 
variation in the data, but such a curve could hardly be taken 
to represent the long time trend.* Minor departures from 

* Hc^arding the employment of potential aeries of the type indicated for 
«I»isenting empirical curves, Steinmetz states that their use is justified: 

1. If the successive coefficients a, b, c . . . decrease in value so rapidly that 
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a simple and uniform trend, linear or otherwise, are to be 
expected with economic data, but, if a real trend exists, 
extreme departures from a fairly simple form are rare. If 
such departures are due to pronounced changes in conditions 
no single line of trend is likely to be satisfactory, and it is 
advisable to break the period into parts, with a separate 
line of trend for each part. “Empirical curves,” says 
Steinmetz, “can be represented by a single equation only 
when the physical conditions remain constant within the 
range of the observations.” Though this statement relates 
to the fitting of curves to data from the physical sciences, 
it applies equally well to econonxic data. 

Detekmination of the Sectjlab Trend of a 
Business Series 

FITTING A straight LINE 

The procedure of fitting certain types of curves to simple 
data has been illustrated in the preceding sections. Before 
proceeding to a discussion of slightly different forms, it 
will be helpful to examine concrete examples of trend 
determination. We first determine the secular trend of 
a series defining the number of concerns in business in the 
United States, during the period from 1899 to 1914.^ The 
observations are given in Table 68, together with the values 
required for the fitting of a straight line to the data, and 
the derived trend values. The values of x represent the 
time factor, while the values of y are the corresponding 
numbers of business concerns. Only the entries in columns 

[Footnote 1 continued from page 256,) 

within the range of observation the higher terms become rapidly smaller 
and appear as mere secondary terms. 

2. If the successive coefficients follow a definite law, indicating a convergent 

series which represents some other function, as an exponential, trigono- 
metric, etc. 

3, If all the coefficients are very small, with the exception of a few of them, and 

only the latter ones thus need to be considered. Cf. C. P. Steinmetz, 
Engineering Mathematics, New York, McGraw-Hill, 1917, 214-215. 

^ Data compiled by Dun and Bradstreet. 
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(2) to (5), it should be noted, are employed in the fitting 
process. 

Table 68 


Number of Concerns in Business in the United States, 1899-1914 

Computation of values required in fitting line of trend 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Year 

a; 

y 

XIJ 


yc 






Trend values 


No, of concerns . 



(linear) .of no. 



in husinessy 
in thousands 



of concerns 
in business^ 
in thomands 


1899 

1 

1,148 

1,148 

1 

1,152 

1900 

2 

1,174 

2,348 

4 ■ 

1,184 

1901 

3 

1,219 

3,657 

9 

1,217 

1902 

4 

1,253 

5,012 

16 

1,250 

1903 

5 

1,281 

0,405 

25 

1,283 

1904 

0 

1,320 

7,920 

36 

1,316 

1905 

7 

1,357 

9,499 

49 

1,349 

1900 

8 

1,393 

11,144 

64 

1,382 

1907 

9 

1,418 

12,762 

81 

1,415 

1908 

10 

1,448 

14,480 

100 

1,448 

1909 

11 

1,480 

16,346 

121 

1,481 

1910 

12 

1,515 

18,180 

144 

1,513 

1911 

13 

1,525 

19,825 

169 

1,546 

1912 

14 

1,564 

21,896 

196 

1,579 

1913 

15 

1,617 

: 24,255 

225 

1,612 

1914 

10 

1,655 

26,480 

256 

1, 645 ' 

Totals 

130 

22,373 

201,357 

1,496 



N 

= 16 

m 

= 22,373 



sw 

= 136 

Zixy) 

= 201,357 



= 1,496 





The equations to be solved in detennining the required 
constants are of the form 


2(2/) = Aa + b-E(x) 

S(xj/) = dS(x) + 62 (x 2 ). 

Inserting the given values in the formulas, we have 
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from which 

a = 1,118.65 
6 = 32.90. 

The equation to the line of best fit is therefore 
2/ = 1,118.65 + 32. 90a; 
with origin at 1898. 

The trend values derived from this equation appeal" in 
column (6) of Table 68. The original data and 
trend are plotted in Fig. 57. The fit for the period covered 



Fig. 57. — Number of Concerns in Business in the United States, 
1899-1914, with Line of Trend 


is good. The number of concerns in business in the United 
States during the sixteen years before the World War is 
well defined by the straight line we have secured. 
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FITTING A POWER CURVE OF THE SECOND DEGREE 

The record of commercial failures in the United States 
over the last forty years provides an example of a series 
following a definitely non-linear trend. Data for the period 
1897-1933 are presented in Table 69, with accompanying 
computations. 

Table 69 


Commercial Failures in the United States, 1897-1933 * 
Computation of values required in fitting line of trend 


( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 

(5) 

Year 

X . 

# 

(No. of 
failures) 

xy 

xhj 

1897 

--18 ;■ 

13,351 

- 240,318 

4 , 325,724 

1898 

■ -17 

12,186 

- 207,162 

3 , 521,754 

1899 

- 16 

9,337 

- 149,392 

2 , 390,272 

1900 

- 15 

10,774 

- 161,610 

2 , 424,150 

1901 

- 14 

11,002 

- 154,028 

2 , 156,392 

1902 

- 13 

11,615 

- 150,995 

1 , 962,935 

mm 

- 12 

12,069 

- 144,828 

1 , 737,936 

1904 

- 11 

12,199 

- 134,189 

1 , 476,079 

1905 

- 10 

11,520 

- 115,200 

1 , 152,000 

1906 

9 

10,682 

- 96,138 

865,242 

1907 

- 8 , 

11,725 

- 93,800 

750,400 

1908 

- . 7 . ■. 

15,690 

- 109,830 

768,810 

1909 

■ ... _ e 

12,924 

- 77,544 

465,264 

1910 

- 5 

12,652 

— 63,260 

316,300 

1911 

- 4 

13,441 

- 53,764 

215,056 

1912 

- 3 

15,452 

— 46,356 

139,068 

1913 

■ ■ ■ ' . --- 2 

16,037 

- 32,074 

64,148 

1914 

■ ' - r 

18,280 

- 18,280 

18,280 

1915 

0 

22,156 

0 

0 

1916 

1 

16,993 

16,993 

16,993 

1917 

. 2- ■ 

13,855 

27,710 

55,420 

1918 

3 

9,982 

29,946 

89,838 

1919 

4 

0,451 

25,804 

103,216 

1920 

5 

8,881 

44,405 

222,025 

1921 

6 

19,652 

117,912 

707,472 

1922 

7 

23,676 

105,732 

1,160,124 

1923 

8 

18,718 

149,744 

1,197,952 

1924 
^ Bun 

9 

and Bradstreet. 

20,615 

185,535 

1,669,815 
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Table 69 — Coniinued 

ComMercial Failures in the United States, 1897-1933 


(1) 

(2) 

(3) 

(4) 

(5) 

Year 

X 

y 

xy 

x^ 

1925 

10 

21,214 

212,140 

2,121,400 

1926 

11 

21,773 

239,503 

2,634,533 

1927 

12 

23,146 

277,752 

3,333,024 

1928 

13 

23,842 

309,946 

4,029,298 

1929 

14 

22,909 

320,726 

4,490,164 

1930 

15 

26,355 

395,325 

5,929,875 

1931 

16 

28,285 

452,560 

7,240,960 

1932 

17 

31,822 

540,974 

9,196,558 

1933 

18 

19,626 

353,268 

6,358,824 

Totals 


610,887 

-h 1,817,207^ 

75,307,301 


N = 37 
•Six) = 0 
S(x2) = 4,218 
S(a:®) = 0 


2(x^) = 864,690 
S(2/) = 610,887 
'S(Ty) = 1,817,207 
X(x^y) = 75,307,301 


The origin is taken at the middle year to facilitate the 
calculations. The values of S(a;^) and S(a:^) may be secured 
from prepared tables, or from the formulas cited on page 254. 

The normal equations to be solved in fitting a second 
degree parabola, with the origin at the naiddle year of the 
period covered, are of the form 

S(y) = Na + cS{x^) 

Xixy) = hZix^) 

Six^y) = d2(x^) + cZix^). 

Inserting the appropriate values, we have 

610,887 = 37a + 4,218c 
1,817,207 = 4,218b 
75,307,301 = 4,218a + 864,690c. 

Solving for the constants, 




a = 14,827.6 
b = 439.82 
c = 14.762. 
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The required equation is 

2/ = 14,827.6 + 430.822 + 14. 762a;2 
with origin at 1915. 

The original observations and the line of secular trend 
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Line of Trend 


are plotted in Figure 68. Observations, trend values and 
deviations from trend are given in Table 70. 

Commercial failures reflect the major cycles in American 
business, but with movements that reverse those of most 
economic series. Failures are numerous in times of depres- 
sion, fewer in prosperity. The reader who will compare the 
deviations from trend shown in Table 70 with the dates 
of reference cycles given on an earlier page will note the 
general agreement. The sharp fall in business failures 
from 1932 to 1933 reflected, of course, the special conditions 
prevailing in the latter year. 
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Table 70 


Commercial Failures in the United States, 1897-1933 
Actual Values, Trend Values, and Deviations from Trend 



Number of 

Trend value, 

Deviation of 

Year 

commercial 

second degree 

actual from trend 

failures 

parabola 


value 

1897 

13,351 

11,855.73 

+ 

1,495.27 

1898 

12,186 

11,769.88 

+ 

416.12 

1899 

9,337 

11,713.55 

— 

2,376.55 

1900 

10,774 

11,686.75 

— 

912.75 

1901 

11,002 

11,689.47 

— 

687.47 

1902 

11,615 

11,721.72 

— 

106.72 

1903 

12,069 

11,783.49 

+ 

285.51 

1904 

12,119 

11,874.78 

+ 

324.22 

1905 

11,520 

11,995.60 

— 

475.60 

1906 

10,682 

12,145.94 

— 

1,463.94 

1907 

11,725 

12,325.81 

— 

600.81 

1908 

15,690 

12,535.20 

+ 

3,154.80 

1909 

12,924 

12,774.11 

+ 

159.79 

1910 

12,652 

13,042.55 

— 

390.55 

1911 

13,441 

13,340.51 

+ 

100.49 

1912 

15,452 

13,668.00 

+ 

1,784.00 

1913 

16,037 

14,025.01 

+ 

2,011.99 

1914 

18,280 

14,411.54 

+ 

3,868.47 

1915 

22,156 

14,827.60 

+ 

7,328.40 

1916 

16,993 

15,273.18 

+ 

1,719.82 

1917 

13,855 

15,748.29 

— 

1,893.29 

1918 

9,982 

16,252.92 

— 

6,270.92 

1919 

6,451 

16,787.07 

— 

10,336.07 

1920 

8,881 

17,350.75 

— 

8,469.75 

1921 

19,652 

17,943.95 

+ 

1,708.05 

1922 

23,676 

18,566.68 

+ 

5,109.32 

1923 

18,718 

19,218.93 

— 

500.93 

1924 

20,615 

19,900.70 

+ 

714.30 

1925 

21,214 

20,612.00 

+ 

602.00 

1926 

21,773 

21,352.82 

+ 

420.18 

1927 

23,146 

22,123.17 

+ 

1,022.83 

1928 

23,842 

22,923,04 

+ 

918.96 

1929 

22,909 

23,752.43 

— 

843.43 

1930 

26,355 

24,611.35 

+ 

1,743.65 

1931 

28,285 

25,499.79 

+ 

2,785.21 

1932 

31,822 

26,417.76 

+ 

5,404.24 

1933 

19,626 

27,365.25 

— 

7,739.25 
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The second degree curve employed to define the trend 
of commercial failures does so with reasonable accuracy 
over the period here covered. Extrapolation beyond those 
limi ts would be hazardous. Indeed, the changed conditions 
under which banking and many other types of business 
were conducted after 1933 may well break the continuity 
of the series, and generate a new long-term trend. 

The Use of Logarithms in Curve Fitting 

The family of curves described above represents a simple 
and very useful type. Perhaps of even greater general 
utility, in the analysis of time series, are curves of a semi- 
logarithmic type. The advantages of plotting many series 
of data on semi-logarithmic or “ratio” paper were explained 
in an earlier section. A fundamental virtue of this type 
of plotting is that it presents a true picture of relative 
variations, of ratios between magnitudes. Relations of 
this type are ordinarily of primary interest in the analysis 
of economic data, and it is logical that determination of 
trends should proceed on the same basis. 

In doing so, we can make use of a group of curves of the 
same general form as those already described, the one 
difference being that log y takes the place of y throughout. 
That is, the straight line form is log y = a -1- bx, while the 
general form for the potential series is log y = a + bx + cx^ 
-f dx^ + ... . The curves secured may be constructed on 
arithmetic paper, plotting the natural x’s and the logarithms 
of the y’s, or natural values of both x’s and y’s may be 
plotted on semi-logarithmic paper, the logarithmic scale 
extending along the ^-axis. The latter is the simpler 
method. 

To illustrate the procedure, the steps involved in fitting 
a curve of the type log y = a -f will be shown. The 
trend of petroleum production in the United States from 
1922 to 1929 is to be determined. The values needed in 
the normal equations are derived from Table 71. 
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Table 71 


Petroleum Production in the United States, 1922-1929 
(Computation of values required in fitting line of trend) 


YBar 

X 

y 

logy 

X ■ logy 

1922 ': 

1 

557.5 

2.74624 

2.74624 

1923 

2 

732.4 

2.86475 

6.72950 

1924 

3 

713.9 

2.85364 

8.56092 

1925 

4 

763.7 

2.88292 

11.53168 

1926 

5 

770.9 

2.88700 

14.43500 

1927 

6 

901.1 

2.95477 

17.72862 

1928 

7 

901.5 

2.95497 

20.68479 

1929 

8 

1,007.3 

3.00316 

24.02528 




23.14745 

105.44203 


iV = 8 S(log 2/) = 23.14745 

1.{x) = 36 l,{x ■ logy) = 105.44203 

S(a:2) = 204 

The two normal equations to be solved are of the form 

S(log y) = Na + 62a: 

2 (a; • log y) = a2a: + 62a:\ 

Substituting the given values we have 

23.14745 = 8a + 366 
105.44203 = 36a + 2046. 

Solving for the constants, 

a = 2.75645 
6 = .03044. 

The equation to the desired curve is, therefore, 
log?/ = 2.75645 + .03044a: 
with origin at 1921. 

In fitting this curve by the method of least squares, as 
is done above, we satisfy the condition that the sum of 
the squares of the logarithmic deviations shall be a minimum. 
That is, the deviations to which this condition relates are 
the differences between the logarithms of the observed 
values and the logarithms of the corresponding trend values. 
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This curve, it should be noted, is not the same as that 
from which the sum of the squares of the arithmetic (natural) 
deviations is a minimum. 

The substitution in the above equation of the value of 
.r representing any given year will enable the logarithm 
of the trend or noimal value to be calculated. The trend 
value in natural numbers may then be determined. In 
Table 72 the normal value for each of the years covered 
is given, together with the percentage relation of actual to 
normal. 

Table 72 

Trend of Petroleum Production in the United States, 1922 - 1929 , 
with Comparison of Actual and Trend Values 
(Straight lino trend determined from logarithmn of production figures) 


Year 

X 

y (actual) 
Produclion 
(In millians 
of bbb,) 

log tfr 

Log of trend 

Vc 

(y, computed) 
Trend value 
(in millions 
of hbls,) 

Percentage rekv 
iion of actual 
to trend 

1922 

1 

557.5 

2.78689 

612.2 

91.1 

1923 

2 

732.4 

2.81733 

656.6 

111.5 

1924 

3 

713.9 

2.84777 

704.3 

101.4 

1925 

4 

763.7 

2.87821 

755.5 

101.1 

1926 

5 

770,9 

2.90865 

810.3 

95.1 

1927 

6 

901,1 

2.93909 

869.1 

103.7 

1928 

7 

mi.B 

2.96953 

932.2 

96.7 

1929 

8 

1,007.3 

2.99997 

999.9 

100.7 


The points representing the actual production, together 
with the line of trend, are plotted in Fig. 59. The graph 
of the derived equation gives a good representation of 
the trend in the present instance. 

An^ equation of this type, defining a linear trend in the 
logarithms of the dependent variable, has certain dis- 
tinctive advantages. The reader will note that this is the 
logarithmic form of an equation to a compound interest 
curve (an exponential curve). This equation wjls given 
in Chapter II as 
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y = fil + rY 
or 

log 2 / = log p + log (1 + r)x. 

In the example just given we have used the symbol a for 
log V and the symbol h for log (1 + r), but the equations 
are identical. 

Millions 



Fig. 59. — Production of Petroleum in the United States, 1922-1929, 
with Line Defining Average Rate of Growth 


We may readily change to natural numbers the constants 
in the equation defining the trend of petroleum production 
from 1922 to 1929.- We have 

log 2 / = 2.75645+ .03044a; 

where 2.75645 is log p and .03044 is log (1 + r). The 
natural number corresponding to 2.75645 is 570.8; the 
natural number corresponding to . 03044 is 1 . 0726. The 
trend of petroleum production in natural form is, therefore, 
p = 570.8(1.0726)“' 

with origin at 1921. 

Subtracting 1 from the constant 1 . 0726 we secure . 0726, 
which is r, the rate of increase of a series growing in accord- 
ance with the compound interest law. (If, on subtracting 
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1, we have a negative value, the growth is negative, of 
course.) This measure indicates that the production of 
crude petroleum increased at an average rate of 7.26 per 
cent a year between 1922 and 1929 (r being multiplied by 
100 to place it on a percentage basis). 

When the trend of a series in time may be defined by a 
straight line on ratio paper, and it is surprising how widely 


1000 


applicable such a function is, the constant r is a highly 
useful measure. It defines the average annual rate of 
growth or decline of the series. It is, of course, an abstract 
measure and thus has the great merit of permitting com- 
parison of the trends of series relating to widely different 
original units. The rate of growth of population, over a 
given period, may have been 1.4 per cent per year; the 
production of gasoline may have increased at a rate of 
4.5 per cent, the production of automobiles at 4.2 per 
cent, the production of wheat at 1 . 1 per cent, total national 
income at 1.6 per cent, total national debt at 3.2 per cent. 
The trends of these series are immediately comparable, and 
conclusions concerning the direction and character of a na- 
tion’s development may be drawn. Thi.s measure provides a 
valuable device for the study of socdal and economic change. 

* In any extensive application of this procedure time and labor may be 


cocn o«^c\ifO‘*t'in«x»N.£O CJ>Or^ oico'ittato 

C4<NI <NJCSIINW<N CM<NI Wm cO COCntOCOCO 
crtC7>crt<7)0ia)0i)0iiCh cucfioh ocnoNO^Oi 

mH *-< r-l 

Prorluction of Crude Petroleum in tlie United States, 
1918-1936, witli Line of Trend 
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Table 73 

Production of Crude Petroleum in the United Stcdes, 1918-1936 
Comparison of Actual and Trend Values 
(Trend values determined from second degree parabola fitted to logarithms of 
production figures) 



y {actual) 

y {computed) 

Percentage 


Production 

Trend value 

relation of 

Year 

{in millions 

{in millions 

actual to 


of bhls.) 

of bhls,) 

trend 

1918 

335.9 

345.0 

97.4 

1919 

378.4 

395.5 

95.7 

1920 

442.9 

449.2 

98.6 

1921 

472.2 

505.2 

93.5 

1922 

557.5 

562.8 

99.1 

1923 

732.4 

620.8 

118.0 

1924 

713.9 

678.2 

105.3 

1925 

763.7 

733.8 

104.1 

1926 

770.9 

786.3 

98.0 

1927 

901.1 

834.3 

108.0 

1928 

901.5 

876.8 

102.8 

1929 

1,007.3 

912.4 

110.4 

1930 

898.0 

940.5 

95.5 

1931 

850.3 

960-0 

88.6 

1932 

785.2 

970.4 

80.9 

1933 

905.7 

971-5 

93.2 

1934 

908.1 

963.2 

94.3 

1935 

996.6 

945.7 

105.4 

1936 

1,098.5 

919.6 

119.5 


I By the use of additional terms a function of the type just 

I discussed may be modified, when dealing with a series 
* marked by non-linear trends on ratio paper. For example, if 
j the course of petroleum production be followed over a longer 

■ period, as is done in Fig. 60, it is obvious that the trend 

I line secured for the period 1922-1929 is inappropriate. The 
i addition of a third constant gives an equation of the type 

log y = a + + cx^. 

{Footnote 1 continued from page 268,) 

saved by utilizing Glover^s mean value table (cf. James W. Glover, Tables of 
Applied Mathematics, George Walir, Ann Arbor, Michigan, 1923, 468ff.)- 
By the use of this table the compound interest curve may be fitted directly 
to the natural numbers. All necessary computations are simply and quickly 
performed. 
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In fitting this to the data of petroleum production for 
the period 1918-1936, we may follow the exact procedure 
used when y was the dependent variable in a similar equa- 
tion (see page 261), except that log y is used throughout, 
instead of y. For the required equation we have 

log 2/ = 2.921331 -f .023660X - .002107a:2 

with origin at 1927. This is shown graphically in Fig. 60. 
Actual and trend values, in natural terms, are given in 
Table 73 on page 269. 

Other Curve Types 

The two families of curves described in the preceding 
sections meet most of the needs of the economic statistician. 
The trend in most time series may be described by curves 
of the power series, fitted either to natural numbers or 
to the logarithms of the data (that is, to the logarithms 
of the y values; time, the a;- variable, is treated in terms of 
natural numbers in fitting both the above types of curves). 
These classes constitute flexible and widely applicable curve 
forms.* Attention may be called to several other curve 
types which have been applied less extensively to time 
series, but with favorable results in particular cases. 

Curves of the ordinary parabolic type (y = ax’') are not 

* There are available for fitting higher degree curves of the power series 
methods that lessen the labor involved, particularly if curves of different degree 
are to be fitted to the same data. These methods, which reduce the fitting 
process to a series of simple adding machine operations, are appropriate to 
extended research projects. Their use is not advisable, however, unless work 
involving a considerable number of routine operations is contemplated. It is 
desirable that the student master the basic least squares procedures outlined 
in the preceding pages, utilisjing other methods only in case extended computing 
tasks are undertaken. 

For accounts of systematic methods suited to extensive calculations, see 
E. A. Fisher, Siaiistical Methods for Research Workers^ Edinburgh, Oliver 
imd lk>yd, Sixth edition, 1936, 148-156; Max Sasuly, Trend Analysis of Statis- 
tics: Theory and Technique j Washington, Brookings Institute, 1934. The ap- 
plication of the method of orthogonal polynomials described by Fisher is ad- 
mirably exemplified in James W. Angell, The Behmrior of Money, New York, 
MeG»w.ffilI, 1936, 195-^, 
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generally applicable to economic data in the form of time 
series, as their use involves the treatment of the time 
variable as a geometric series. Such a curve, it will be 
recalled, becomes a straight line on double logarithmic 
paper. Yet if a curve of this form serves accurately to 
describe the trend of a given series, its use is justified, 
empirically. 

Such curves may be fitted most readily by employing 
logarithms and using an equation of the linear type. The 
equation 

y = ox*" 

becomes, in logarithmic form, 

log 2 / = log a + & log X. 

The two normal equations needed in fitting such a curve 
are of the form 

S(log y) — n log o + bSflog x) 

S(log X ■ log y) = log aS(log x) + J>2(log x)^. 

By substituting the values computed from the data, these 
equations may be solved for log a and b, just as in fitting 
an ordinary straight line.^ 

The equation to the simple exponential curve may be 
written 

y — ar*. 

(The r in this equation is the equivalent of 1 + r, as given 
on p. 267.) This equation may be used to define the trend 
of a series increasing or decreasing in geometric progression. 
It has been observed that the trends of economic series 
frequently depart from such a geometric progression by 
constant magnitudes. By adding this magnitude, in a 
given case, to the original series (or subtracting it), a 

^ A useful table of the sums of the logarithms of the natural numbers from 
1 to 100 is included as an appendix to Medical Biometry and Statistics^ by 
Raymond Pearl, Philadelphia, Saunders, 192B. 
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modified series with a clear exponential trend may be 
secured. The trend of the original series may be written 

y = K + 

where K is the constant magnitude by which the series 
departs from a geometric progression. A modified exponential 
curve of this type may give a highly satisfactory representa- 
tion of trend, in certain cases. The method employed in 
fitting such a curve is discussed in Appendix D. 

Some use has been made, in the interpretation of eco- 
nomic statistics, of the Gompertz curve, the equation to 
which was originally developed in the actuarial field. The 
equation is 

y = d!)'*. 

Its use in the analysis of economic statistics has been based 
upon the argument that there is a general law of growth 
characteristic of population increase, and that this same 
type of growth is found in industries whose products are a 
direct function of the growth of population. 

X somew'hat, similar curve of growth, the “logistic,” has 
been employed by Verhulst and more recently by Raymond 
Pearl and Lowell J. Reed in forecasting population growth. 
This curve has been found to describe the trends of cer- 
tain economic series. Examples of the procedures employed 
in fitting Gompertz and logistic curves are given in Appen- 
dix D. 

The Determination of Monthly Trend Values 

The procedures so far described have dealt with aimual 
measurements only. Having fitted a line or curve to annual 
data it is frequently necessary to make a transition to 
monthly units. Problems involving such monthly measure- 
ments are faced in the study of cyclical movements w'hich 
are discussed in the next chapter. 

The constant a in the trend ecjuation defines the trend 
value in the year taken as origin. If the annual data 
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employed in the fitting processes are averages of twelve 
monthly values (e.g., the average price of pig iron in 1937) 
the constant a measures the trend value for a month cen- 
tering at the middle of the year covered by the annual 
figures. If the annual data are aggregates of twelve monthly 
values (e.g., total production of pig iron in 1937) the constant 
a must be divided by 12 to obtain the trend value for the 
month centering at the middle of the year. 

If the trend be linear, the constant b in the equation 
y = a + bx defines the change due to trend over a twelve- 
month period. In interpolating for monthly trend values, 
the increment (or decrement) from month to month (e.g., 

from January to February of the year 1937) is 

annual data employed in the fitting process are averages 
of monthly values. The increment from month to month is 

if the annual data are aggregates of monthly values. 

144 

The one further step needed is properly to center the 
monthly trend values. These should, of course, be centered 
at points of time corresponding to those to which the 
actual monthly data relate. In averaging, or aggregating, 
monthly data relating to the middle of each of the twelve 
months in a calendar year we secure a figure centering at 
July 1. The month centering at the middle of the year 
of origin thus centers at July 1. For comparison with actual 
monthly data, we desire trend values centering at July 15, 
August 15, etc. At the beginning, therefore, we must add 
to the trend value for the month centering at the middle 

of the year of origin ^that is, to o or to one-half of the 

month-to-month increment (or decrement) that we have ob- 
tained from 6 of the trend equation. This procedure ^ves us 
the trend value for the month centering at July 15. This value 
may be compared with the actual value recorded for that 
month. The addition to this of the month-to-month trend 
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increment (or decrement) gives trend values for all following 
months; subtraction gives trend values for all preceding 
months.^ 

The Selection of a Cukve to Rbpebsent Tkend 

Various types of curves which may be fitted to represent 
the trend of economic data over a period of time have been 
described. But which of these many types is to be selected 
in a given case? Which will give the best standard of 
normality for each of the years covered? Several references 
to this problem have been made in the preceding sections, 
but no general principles have been laid down. And, in 
fact, no general principles can be evoked to answer this 
fundamental question. There is no absolute test of goodness 
of fit in such cases. It is largely a matter of personal judg- 
ment as to the type of curve which best represents the 
trend in a given instance, and experience must play a 
dominant part in such judgments. But certain general 
considerations are of assistance in selecting the appropriate 
type of curve. 

1. The first step in the selection of a curve type is the 
plotting of the data. When this has been done, it is fre- 
quently possible by inspection to determine the appropriate 
form. The data may be plotted in four different combina- 
tions, of which the first two are of chief importance in 
dealing with economic material. 

a. Natural *, natural y. (That is, plot the given figures on ordi- 

nary arithmetic paper.) 

b. Natural x, log y. (Plot the x’s on the natural scale, and plot 

the y’a on the logarithmic scale; i.e., use semi-logarithmic 

paper.) 

* If the original monthly data relate to the first or last of the month, rather 
tean the middle, a similar correction is needed, but the monthly dates named 
in the text would be different, of course. If the trend equation is non-linear, 
tte process of interimlation mast be correspondingly modified. For a discus- 
sSq® of appropriate procedures the reader is referred to any treatise dealing 
with the general principles of interpolation. The Cnkulm of Ohservatums, 
by Whitaker and Robinson, contains an excellent treatment of this topic. 
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c. Natural y, log x. (Plot on semi-logarithmic paper, with the 

x-scale logarithmic.) 

d. Log 2/, logx, (Plot on paper with logarithmic ruling on both 

scales.) 

If in any of these cases a straight line trend is secured, 
a type of equation which plots as a straight line under the 
given conditions (cf. Chapter II) would be selected. If a 
linear equation is not appropriate some other simple type 
may be suggested by the plotted data. In studying such 
graphs for the purpose of selecting a curve to represent 
trend, one should be familiar with the curves representing 
all the simpler equations. 

2. The appropriate curve may be determined by a study 
of the relations between the two variables, x and y. In 
the simpler cases the following relations hold: ^ 

a. If, when the values of x are arranged in an arithmetic series, 

the corresponding values of y form a geometric series, the 
relation is of the exponential type, described by the equa- 
tion 

y ^ ab^. 

b. If, when the values of x are arranged in a geometric series, the 

corresponding values of y form a geometric series, the rela- 
tion is of the simple parabolic or hyperbolic type, described 
by the equation 

y = 

c. If, when the values of x are arranged in an arithmetic series, 

the first differences of the corresponding y^s are constant, the 
relation is of the straight line type, described by the equa- 
tion 

y a + bx. 

The differences between successive y values, when are arranged in an 
arithmetic series, are termed ‘‘first differences’^ or “first order differences” 
and are represented by the symbol Ay. The differences between successive 
first differences are called “second differences” and are represented by the 

^ It will be recalled that an arithmetic series changes by a coiistant absolute 
increment, while a geometric series clianges by a constant percentage. 



MEASUREMENT OF TREND 


symbol Differences of higher order are similarly derived. The follow 
table illustrates the formation of differences: 


d. If, when the values of x are arranged in an arithmetic series, 
the nth differences of the corresponding y’s are constant, the 
relation between the variables is described by an equation of 
the potential series carried to the nth power of a;; that is, by au 
equation of the type 

tj = a + bx cx^ + dx^ + • ■ • + 

Thus, in th(^ example given above, in which the third differ- 
ence.s are constant, the relation between x and y would be 
described by au equation of the form 

y = a bx + cx'^ + dx^. 

When one is selecting a curve to use in the analysis of 
economic data, he will rarely, if ever, find these tests to 
be met perfectly. This would happen only when the curve 
chosen passed through all the plotted points. But data 
in a given case will generally approximate some one of 
the conditions described above, and the appropriate type 
of curve wall be indicated. 

3. If the study of the original data does not render a 
definite decision possible, several types of curves may be 
fitted to the data and the decision made by comparing 
the results. If the equations to the curves being compared 
contain the same number of constants, a comparison of 
the root-mean-square deviations about the curves furnishes 
a conclusive and valid test of the closeness of the fit within 
the limits of the data. 
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The root-mean-square deviation may be readily computed 
by making use of the following relationship 

S(d^) = S(y2) _ aZ{y) - h^{xy) - cZ{x^y) - . . . 

where S(d^) is the sum of the squares of the deviations 
about the line of trend. (The derivation of this equation is 
explained in Appendix A, in which a generalized form is 
given.) If the equations do not contain an equal number 
of constants, a test of this sort is invalid and the comparison 
can only be made by inspection. Personal judgment as 
to the curve which represents the trend most accurately 
must be the basis of the decision in such cases. 

It should be remembered that the closeness of fit within 
the limits of the data is not of itself a final criterion. An 
equation could be secured, having a number of constants 
equal to the number of points, which would give a curve 
passing through every point plotted, yet such a curve 
would not necessarily represent the trend. The concept 
of a trend is of a regular, smooth underlying movement, 
from which there are deviations, but which marks the long- 
time tendency of the series. In general, therefore, the curve 
should be of simple form, if it is to be consistent with the 
concept of secular trend. This does not mean, however, 
that a complex trend can be represented by a simple curve 
which fails to conform to the plotted data. 

4. An important question to be answered before the 
form of curve can be selected relates to the limits within 
which the line of trend is to be used. If it is to be used only 
within the limits of the plotted data (i.e., for interpolation) 
one set of considerations governs the choice of a curve. If 
it is to be projected beyond the limits of the data, used as 
a basis for the determination of normal during a subsequent 
period, other considerations enter. In the former case a 
reasonable fit to the data is the sole requirement; in the 
latter case it is necessary, in addition, that the trend of the 
projection be logical, and consistent with the past record. 
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The fact should be clearly recognized that projection, or 
extrapolation, represents a guess, justified only on the 
assunaption that a proper line of trend has been fitted and 
that the same conditions that affected the series in the 
past will prevail in the future. A change in conditions, the 
introduction of new elements, renders the projection invalid. 
When dealing with economic statistics, moreover, it is 
ordinarily impossible to tell, except in retrospect, when a 
change has taken place. Conclusions drawn from the pro- 
jection of a line of trend are always subject to error, therefore. 
In practical statistical work such projections are made, 
and are justified on the ground that the most probable 
course in the future is that which prevailed in the past. 
Projections into the distant future are, of course, subject 
to wider margins of error than short-time projections. 
Lines of trend should be revised from time to time, there- 
fore, as new data become available. 

When a projection is to be made, a simple curve with few 
constants is to be preferred to a more complicated one. A 
third or fourth degree parabola may give an excellent fit 
to the data in a given case, but the projection of such curves 
is inadvisable. It is well to remember, as Perrin has pointed 
out, that a curve suitable for interpolation may not be at 
all adapted to extrapolation. 

The avoidance of distortion of trend lines by abnormal 
conditions in the terminal years of the period studied is 
particularly important when a trend is to be projected. 
Reference is made to this point in the next chapter. 

It seems to be true, in general, that simple curves fitted 
to the logarithms of the i/s give more reliable results when 
projected than curves fitted to the natural numbers. In 
an interesting discussion of this point, Karl G. Karsten^ 
argues that phenomena characterized by a uniform rate 
of change are more likely to maintain their trend than 
phenomena marked by a uniform amount of change. It is 

* K«rl Karsten, Cterte and Graphs, New York, Prentice Hall, 1923 , 423 - 426 . 
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the semi-logarithmic curves, of course, which best measure 
rates of change. 

5. It is frequently true that no one curve will fit a given 
series during the entire period it is desired to study. This 
may be due to changes in conditions which cause the trend 
to be altered. Thus the trend of wholesale prices was 
downward, in a direction well represented by a straight 
line, from the close of the Civil War to 1896. From 1896 
to the beginning of the World War the trend was upward, 
and could be described by a second degree parabola. From 
1921 to 1929 the trend was also curvilinear, rising to 1925, 
declining thereafter. Similar changes occur in many eco- 
nomic series. By breaking the entire period up into sections, 
appropriate lines of trend may be fitted to the several 
periods thus marked off. This process may be carried to 
a quite illogical extreme, however. The concept of trend 
is of a gradual, long-term change, and the breaking up 
of a series in order to fit a number of trend lines is contrary 
to the whole conception. It may be justified upon occasion 
when a real change in conditions occurs, but in all cases 
the attempt should be made to represent the trend during 
the whole period by a single line. 

Deflation as a Step in Analysis 

Many series of economic data are expressed in monetary 
units, in dollars, pounds, or francs. Such series are subject 
to distortion because of changes in the price level. Thus 
the value of heavy engineering contracts awarded in the 
United States in 1913 amounted to approximately 601 
millions of dollars; in 1929 the value of engineering contracts 
awarded in the same territory amounted to approximately 
3,950 millions of dollars.^ Was the volume of engineering 
construction in 1929 over six times that in 1913? It was 
not. The value of construction contracts awarded in any 
year depends not only upon the actual volume of construe- 

^ Figures compiled by Engineering News Record* 
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tion but also upon the costs of construction materials and 
labor, and these costs increased substantially from 1913 
to 1929. If we wish to measure the change in the volume 
of construction alone, these values must be corrected for 
the increase in construction costs between 1913 and 1929. 
Such a process is termed deflation.'^ • 

The selection of an appropriate deflating index is the 
central problem in such cases. For the present purpose we 
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Fig. 61. — Comparison of Actual and Deflated Values of Contracts 
Awanied in Engineering Construction, 1913-1936 

may use an index of constructive costs, based upon the 
prices of steel, cement, and lumber, and upon wage rates 
m construction industries, compiled by the Engineering 
News Record. This index shows that construction costs 
m 1929 were approximately 107 per cent higher than in 

> The term deflation is not inappropriate when correction is being made for 
nri<^™Tif ™ “ less suitable when correction is made for a fall in 

pnces. The period selected as a standard of reference may be one in which 

^ mdex^resting on S^h a 

year as base will raise values relating to other periods. The word dedation 

tto somewhat technical sense we must understand it to mean correct for 
in the ,nlne of the dollar (as measured by specific Ss of 
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Table 74 


Actual and Deflated Values of Contracts Awarded in Engineering 
Construction, 



Contracts awarded^ 


Deflated valm of 


engineering con’- 

Index of 

contracts awarded 

Year 

sifuction {monthly 

construction 

{monthly average, 


average, in thou- 

costs ^ 

in thousands of 


sands of dollars) ^ 


dollars) 

1913 

50,117 

1.000 

50,117 

1914 

48,574 

.886 

54,824 

1915 

48,740 

.926 

52,635 

1916 

77,778 

1.296 

60,014 

1917 

61,592 

1.812 

33,991 

1918 

82,729 

1.892 

43,726 

1919 

97,991 

1.984 

49,391 

1920 

126,923 

2.513 

50,507 

1921 

99,459 

2.018 

49,286 

1922 

129,716 

1.745 

74,336 

1923 

158,670 

2.141 

74,110 

1924 

166,593 

2.154 

77,341 

1925 

213,287 

2.067 

103,187 

1926 

237,820 

2.080 

114,337 

1927 

271,147 

2.062 

131,497 

1928 

298,215 

2.068 

144,205 

1929 

329,193 

2.070 

159,030 

1930 

264,438 

2.029 

130,329 

1931 

202,693 

1.814 

111,738 

1932 

101,609 

1.570 

64,719 

1933 

89,031 

1.702 

52,310 

1934 

113,383 

1.981 

57,235 

1935 

132,513 

1.952 

67,886 

1936 

198,904 

2.065 

96,322 


1913 (the index is 100 for 1913, 207 for 1929). Dividing 
the 1929 aggregate by 2.07, to correct for the change in 
costs, we secure a deflated total of 1,908 naillions of dollars. 
This may be taken to measure the aggregate value of 
engineering contracts awarded in 1929, vphen the 1913 
dollar is used as a standard of value. (In this process the 
value of money is assumed to be held constant with ref- 


* Data on contracts awarded have been compiled by the Engineering News 
Record; the index of construction costs has been computed by the same agency. 
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ereiice to the year which is the base of the price or cost 
index used as a deflator.) If the deflating index may be 
accepted as an accurate measure of changing costs, the 
deflated series may be assumed to define changes in the 
actual volume of engineering construction. The effects of 
changing prices and wages will have been eliminated. 

The general procedure is illustrated in greater detail in 
Table 74 on page 281. Actual and deflated series are plotted 
in Fig. 61. The degree to w'hich changing monetary values 
distorted the construction series may be readily appreciated 
from the diagram. 

Most value series are affected by price changes, and it is 
generally advisable to correct for this factor before further 
analysis is attempted. Each case presents a new problem, 
for no general deflating index is suitable to all series. The 
index of wholesale prices compiled by the United States 
Bureau of Labor Statistics has been used extensively in 
deflating economic data expressed in dollar values, but this 
index is not at all appropriate in many of the cases in 
which it has been employed. It is absurd, for instance, to 
deflate money wages by an index of wholesale prices. The 
deflating index employed should be a measure of price 
changes as they affect the series being deflated. 

The deflation of a value series is in general a first step 
in the study of that series. The way is then open for further 
analysis by methods explained in the present and succeeding 
chapters. 
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CHAPTER VIII 


THE ANALYSIS OF TIME SERIES: MEASUREMENT 
OF SEASONAL AND CYCLICAL FLUCTUATIONS 

The measurement of secular trend is but one of the 
problems connected with the analysis of a series in time. 
Such series, it has been pointed out, are subject to periodic 
fluctuations, seasonal and cyclical in character, and these 
fluctuations are generally more important in their effects 
upon business than is the long-time trend. Our present 
concern is with methods of isolating such periodic variations. 
The series, in Table 76, which clearly reflects the seasonal 
and cyclical swings of domestic trade in the United States, 
may be used to illustrate methods of measuring these 
movements. 

Table 75 


Average Weekly Freight Car Loadings in the United States, 
1918-1927 > 

(Unit: 1,000 cars) 


Month 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

January 

655 

'728'' 

820 

706 

696 

848 

859 

891 

920 

944 

February 

753 

687 

776 

685 

, 767' 

854 

908 

906 

932 

956 

March 

842 

696 

848 

691 

818 

916 

916 

926 

960 

998 

April 

873 

721 

730 

706 

716 

941 

874 

932 

966 

■■ 969 

May 

897 

759 

862 

760 

776 

975 

895 

971 

1,018 

1,004 

June 

918 

796 

896 

762 

831 

1,012 

906 

992 

1,052 

1,021 

July 

970 

887 

901 

750 

813 

985 

881 

975 

1,037 

978 

August 

962 

892 

969 

810 

853 

1,042 

969 

1,073 

1,106 

1,073 

September 

956 

960 

967 

842 

925 

1,037 

1,037 

1,074 

1,140 

1,093 

October 

925 

967 

1,005 

932 

978 

1,070 

1,091 

1,107 

1,184 

1,101 

November 

819 

807 

884 

764 

957 

964 

976 

1,024 

1,042 

926 

December 

719 

758 

755 

681 

832 

827 

869 

925 

858 

814 

Average 

857 

805 

868 

757 

829 

956 

932 

983 

1,018 

990 


Data from the Annml Bulletin of the American Railway Association and 
the Surv^ of Current Business. The published figures have been slightly re- 
to take account of calendar variations. 
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m 

For the present purpose the study of seasonal and cyclical 
variations in freight car loadings is limited to the period 
1918-1927. The disturbances of the ensuing period, com- 
bined with changes in railroad operating methods and busi- 
ness practices, materially modified the behavior of this series. 
The demonstration of statistical procedure will be clearer 
if restricted to the relatively homogeneous period here cov- 
ered. 

The Measurement of Seasonal Fluctuations: Moving 

Averages 

Moving averages provide a useful method of defining 
seasonal variations. Since these fluctuations take place 
within a constant period of twelve months, a moving average 
may be used with more confidence than when a cycle of 
varying length is involved. The magnitude of the fluctua- 
tions (the amplitude of the seasonal swings) will not ordinarily 
be constant, hence the line marked out by the moving 
averages will not be completely free of seasonal influences. 
The relation of the actual monthly items to the moving 
averages may be averaged, however, and the indices of 
seasonal variation based upon these averages. 

It is essential, of course, that the moving average, cen- 
tered, fall at the same date as the original figure with 
which it is to be compared. This involves a second process 
of averaging. For example, the weekly averages of freight 
car loadings relate to the middle of each month. The 
average of the twelve monthly items for 1918, when centered, 
falls on July 1st. The average of the items from February, 
1918, through January, 1919, centered, falls on August 1st. 
To secure a figure comparable with the July 15th average, 
these two must be averaged. By this process of computing 
a tw'o-month moving average from the twelve-month aver- 
age, comparability with the original figures is secured. 
Table 76 presents averages obtained in this way for the 
period from July, 1918, to June, 1927. 
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Table 76 

Moving Averages of Freight Car Loadings, 1918-1927 
(12-month moving average, centered, adjusted by 2-month moving average, 

centered) 


Month 

1918 

1919 

1920 

(Unit: 1,000 cars) 
1921 1922 1923 

1924 

1926 

1926 

1927 

Jan. 


808.0 

850.8 

809.6 

783.7 

916.8 

935.9 

957.3: 

1,004.8 

1,019.1 

Feb. 


801,7 

864.6 

796.7 

788.1 

930.9 

928.6 

965.6 

1,008.7 

1,015.3 

Mar. 


798.9 

858.1 

784.9 

793.4 

943.4 

926.5 

971.5 

1,012.8 

1,012.0 

Apr. 


800,8 

860.0 

776.8 

798.8 

961.9 

926.4 

973.7 

1,018.8 

1,006.5 

May 


802.1 

864.8 

768.8 

808,7 

966.0 

927.8 

976.3 

1,022.8 

998.3 

June 


803.2 

867.9 

760.6 

823.0 

966.1 

930.0 

980.7 

1,020.7 

991.6 

July 

860.5 

808.7 

863.0 

767.0 

835.7 

956.4 

933,1 

984.2 

1,018.9 


Aug. 

860.8 

816.2 

864.6 

769.6 

846.0 

959.1 

934.3 

986.5 

1,020.9 


Sept. 

861.9 

826.3 

844.1 

767.9 

854.2 

961.3 

934.7 

989.0 

1,023.5 


Oct. 

839.6 

833.0 

836.6 

773.6 

867.6 

958.6 

937.5 

991.8 

1,025.2 


Nov. ■ . 

827.4 

&37.6 

831.3 

774.7 

885.3 

962.4 

943.1 

995.2 

1,024.8 


Dee. 

816.6 

846.1 

821.5 

778.2 

901,1 

944.7 

949.8 

999.7 

1,022.9 



The original data are now expressed as percentages of 
the corresponding moving averages. These percentages 
are given in Table 77. 


Table 77 


Percentage Relation of Actual Freight Car Loadings to 12-Month 
Moving Averages 


Month 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

Jan. 



90. 

1 

96, 

.4 

87 

2 

88. 

8 

92 

,6 

91 

,8 

93 

.1 

91, 

.6 

92 

,6 

Feb. 



85. 

,7 

90, 

.8 

86 

0 

96. 

1 

91 

.7 

97 

.8 

93 

.8 

92, 

.,4 

94 

.2 

Mar. 



87. 

,1 

98, 

.8 

88 

,0 

103. 

1 

97 

.1 

99 

.0 

95 

.3 

94. 

,8 

98 

,6 

Apr. 



90, 

.0 

84, 

,9 

90 

9 

89. 

6 

98 

.9 

94 

.3 

95 

,7 

94. 

8 

96 

,3 

May 



94, 

6 

99 

.7 

98 

9 

96. 

0 

102 

.0 

96 

.5, 

99, 

,5 

99. 

,5 

100 

.6 

June 



99 

.1 

103 

.2 

100 

2 

101. 

0 

105 

,8 

97 

.4 

101 

.2 

103. 

,1 

103 

.0 

July 

112 

.7 

109 

.7 

104, 

.4 

99 

,1 

97. 

3 

103 

,0 

94 

.4 

99 

.1 

101. 

,8 



Aug, 

111 

.8 

109, 

.3 

113 

.4 

106 

6 

100. 

8 

108 

,6 

103 

,7 

108 

.8 

108 

.3 



Sept. 

112 

.2 

116 

.2 

114 

.6 

109 

6 

108. 

3 

107 

.9 

no 

.9 

108 

.6 

111 

.4 



Oct. 

110, 

,2 

116. 

,1 

120, 

.1 

120 

5 

112. 

7 

111 

.6 

116 

.4 

111 

.6 

115 

.5 



Nov. 

99. 

0 

96. 

3 

106, 

.3 

98 

6 

108. 

1 

101 

.2 

103 

.5 

102 

.9 

101, 

,7 



Dec. 

88. 

0 

89. 

0 

91, 

,9 

87 

5 

92. 

3 

87 

.5 

91 

.5 

92 

.5 

83, 

.9 




These percentages show some variation from year to 
year in the relation of the figures for a given month to 
the moving average. Thus the January figures, while 
always below the average, vary from 87.2 per cent to 
96.4 per cent of the average. The nine percentages secured 
for each month must be averaged to obtain the index 
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desired. Either the arithmetic average or the median may 
be employed for this purpose. The results seemed by 
applying the two methods are shown m Table 78. in 
columns (2) and (3) the actual arithmetic means and 
medians are given. The average of the twelve arithmetic, 
means happens to be exactly 100, so no further adjustment 
is needed. Usually the average will depart in some degree 
from 100, as it does for the medians. When this is the 
case the twelve monthly index numbers must be adjusted 
to make their average equal to 100. The items in 
(4) have been secured from the items in column (3) by 
dividing throughout by 1.00367. 

Table 78 

Indices of Seasonal Variation in Freight Car Loadings, Compared 
from Moving Averages 

(1) (2) (3) W 

^ ^ Arithmetic Medians 

Month means (unadjusted) (adjusted) 

1.6 91.8 91.5 

2.1 92.4 ^2.1 

5.8 97.1 96.7 

2 8 94.3 94,0 

>8 6 99.5 99.1 

II 6 101.2 100.8 

( 2‘4 101.8 1014 

17 9 108.6 108.2 

11 110.9 110.6 

l5 0 115.5 115 1 

S:? 101.7 101.3 

§9.4 89.6 , ^9.3 

30T6 100.367 100.0 

OF Index Numbeks . of Season ad 
Averaging Ratios to Trend 
r method of securing seasonal indices, 
listinctive advantages, involves the 
0 trend. ^ In the application of this 
sthod were worked out independently by Helen D. 


January 

February 

March 

April 

May 

June 

July 

August 

September 

October 

November 

December 

Average 
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method, a suitable line of trend, linear or non-linear, is 
fitted to the data, the actual monthly items are expressed 
as percentages of the corresponding trend figures, and then, 
for each month, an average of the percentage ratios of 
the actual to the trend values is secured. This procedure 
is identical with that described in connection with the 
use of moving averages, except that the actual values may 
be expressed as percentages of normal values derived from 
any function employed to represent trend. In the selection 
of an average value for each month, use may be made of 
a multiple frequency table in obtaining an understanding 
of the nature of the actual seasonal movement. With the 
help of such a table the existence of a definite seasonal 
movement may be verified and the type of average to be 
used in securing a typical value for each month may be 
determined. (It would, of course, be equally appropriate 
to use a table of this tjrpe in connection with the method 
of moving averages.) We shall apply this method to the 
data employed in the preceding examples. 

A straight line, fitted to annual averages of the data of 
freight car loadings from 1918 to 1927, as given in Table 75, 
is described by the equation 

2/ = 769.00 + 23.727:t: 

with origin at July 1, 1917. Normal values for each month 
may be computed readily. * The normal value for the 
month centering at July 1, 1917, is 769.00 (i.e., the constant 
a of the trend equation). Since the increment over a twelve- 
month period is 23.727, the increment from month to 
month is one twelfth of this, or, 1 . 977. Hence the normal 
value for the m onth centering at January 1, 1918, is 769.00 

(Footnote t continued frofu page ^S7.) 

Falkner Measurement of Seasonal Variation,” Journal of the Awietican 
Smisiical Aseodaiion, June, 1924, 167-179, and Lincoln W. Hall, '"Seasonal 
Variation as a Relative of Secular Trend,” Journal of the American Statistical 
Association, June, 1924, 156-166. 

1 Methods used in the determination of monthly trend values are discussed 
m Chapter VII, 
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4- (6 X 1.977), or 780.862. But the average weekly freight 
car loadings for January, 1918, must be taken to center 
at January 15th. The monthly trend value centering at 
that date is 780.862 + 1(1.977), or 781.850. The trend 
value for February, 1918, is secured by adding to 781.850 
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Fig. 62. — Frequency Distributions: Monthly Freight Car Loadings 
Expressed as Relatives of Corresponding Trend Values 


the monthly increment, 1.977. A similar process gives 
the value for each succeeding month. The results, rounded 
off to the nearest whole number, are given in column (2) 
of Table 80. 

Expressing each of the given values for each month as a per- 
centage of the corresponding trend value, we secure ten 
such relative figures (since the data cover ten years). The 
ten January percentages vary from 79.4 to 98.9, the ten 
October percentages from 107.0 to 119.7, etc. The multiple 
frequency table which appears in Fig. 62 is constructed 
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by classifying, in the form of a frequency distribution, the 
items for each month. 

The presence of a distinct seasonal variation is dem- 
onstrated by this table. Freight traffic is consistently 
low in the winter months. Activity is somewhat greater 
in the spring, and reaches a peak as a result of harvesting 
and other demands in the late summer and fall. 

The tabular summary facilitates the selection of a type 
of average for the measurement of the seasonal movements. 
The median is likely to be unrepresentative; it is subject 
to material change in value by the addition or withdrawal 
of one or two entries, unless there is a definite concentration 
in the monthly frequency distributions. The arithmetic 
mean of all the items, on the other hand, may be unduly 
affected by exceptional cases. An alternative method is 
provided by the possibility of taking the arithmetic mean 
of the central items for each month. If an inspection of 
the multiple frequency table does not lead to an immediate 
decision as to which is the best type of average to employ 
in a given case, several index numbers may be worked 
out for each month, and a decision reached after a compari- 
son of the results. (Indeed, since the determination of a 
typical value is a separate problem for each month, the 
method of averaging employed might vary from month 
to month.) In the present instance the seasonal variation 
is fairly regular, year after year. No great differences 
would appear in the results secured by averaging varying 
numbers of items. Index numbers based upon averages 
of the four central items for each of the twelve months 
are appropriate in this ease. (In general, an average of 
three, four, or five central values is more likely to be stable 
and representative than either the median or an average 
of all the items for each month. The greater the concentra- 
tion in the monthly frequency tables, the smaller the number 
of items upon which the index numbers may be based.) 

The simple averages of the four central items constitute 
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the unadjusted index numbers given in Table 79. Correcting 
these so that the average of each group is equal to 100, 
we secure the adjusted index numbers presented in the 
same table. (These averages have been derived, not from 
the frequency distributions shown in Fig. 62, but from 
individual percentages defining the relation of actual to 

trend values.) 

Table 79 

Indices of Seasonal V ariation in Freight Car Loadings^ Based 
upon Percentage Ratios of Actual Values to Linear 
Trend Values 

Unadjusted Adjusted 

index numbers index numbers 

Month (based upon, four (based upon four 

central iteyns) central items) 


January 

February 

March 

April 

May 

June 

July 

August 

September 

October 

November 

December 

Average 


Unadjusted 
index numbers 
(based upon four 
central items) 

92.9 

94.8 

98.6 
94.3 

100.2 

102.4 

102.8 

109.7 

112.3 

115.6 

103.6 

89.9 
101.425 


The index numbers of seasonal variation derived from 
ratios to trend accord very closely with those computed 
from moving averages. The widest discrepancy, for the 
month of February, amounts to only 1.4. The consistency 
of the seasonal movement in freight car loadings helps 
to explain this close agreement. In general the two methods 
here exemplified will yield results that are fairly close 
together. Both are well adapted to the measurement of 
seasonal changes in homogeneous series. Simpler methods 
may be used on occasion, and more involved methods may 
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be required in dealing with non-homogeneous series where 
there is reason to suspect that the pattern of seasonal 
movements has been modified during the period under 
observation. 

Modifications of these general procedures are necessary 
when the pattern of the seasonal movements in a given 
series is altered during the period under observation. Two 
types of shifts in seasonal variation may be distinguished. 
The first includes shifts that are irregular over time, but 
that are related to definable causal factors. Thus the price 
of an agricultural product may follow one seasonal pattern 
in years of high production, and quite a different pattern 
in years of low production.^ Where this condition prevails 
it may be possible to compute two sets of seasonal indices, 
each to be applied imder appropriate conditions. Methods 
already described may be used in the construction of such 
indices. Of this irregular type, also, are alterations in 
the seasonal pattern of an economic series that reflect sharp 
changes in business practice. Shifts in the dates of the 
annual automobile shows in the United States have mater- 
ially altered the seasonal index of automobile sales. 

The second type of seasonal modification is progressive 
in character. The change in pattern is not sudden, nor 
does it reverse itself. Slow alterations over time in trade 
practices and consumption habits bring such evolutionary 
or secular changes. The slow displacement of the open 
car by the closed car brought such a progressive modification 
in the seasonal pattern of automobile sales. In the computa- 
tion of seasonal indices under these conditions persistent 
changes over time in the figures for each month may be 
measured separately. Thus, when ratios to trend have 
been obtained, all the January itenas (such as those plotted 
in Pig. 62) may be plotted chronologically. The progressive 
change in the January relatives from 1920 to 1937, say, 
is then defined by a line of secular trend. The trend value 

^ See F. L. Thompson, AgricvMural Prices, New York, McGraw-Hill, 1930. 
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for January of 1920 is a first approximation to the January 
seasonal index for 1920. The figure for February of 1920 
is obtained in the same way, and so for each month of 
1920. Adjustment of these prelinainary values to make 
their average equal to 100 gives a set of seasonal indices 
for 1920. Seasonal indices for other years are computed 
in the same way. 

This method is, of course, more laborious than the prO“ 
cedure followed when the seasonal pattern remains con- 
stant. Before applying the more complicated method the 
investigator should assure himself that the shift in pat- 
tern is real, and not merely a reflection of accidental varia- 
tions.^ 

The Measurement of Cyclical Fluctuations 

There remains the task of combining the corrections for 
secular trend and seasonal variation in order to secure 
measures of cyclical changes in a given series. Major 
interest in most economic studies attaches to these cyclical 
changes, and the measurement of such changes is usually 
the central problem in the analysis of time series. The 
complete elimination of all non-cyclical movements is impos- 
sible, of course. We must content ourselves with measures 
reflecting cychcal changes intermingled in rather uncertain 
proportions with accidental fluctuations. 

The procedure may be illustrated with reference to the 
data of freight car loadings in the United States, presented 
in Table 75. For the purposes of the present illustration 
the study will be restricted to the decade 1918-1927. The 

^ Tests of sampling errors are discussed in Chapters XIV, XV, and XVltl. 
The test of a linear trend in this case would relate to the slope b of the Ime 
j&tted to the relatives for a given month. 

The literature on the measurement of seasonal fluctuations is extensive. 
The references at the close of this chapter contain detailed accounts of 
modifications of the basic procedures discussed above. A rapid, 
and accurate graphic method, suitable for use by the student who has 
the essentials of the formal procedures, is explained in the article by Willis-^ • 
Spurr. Spurr's method utilizes relative logarithmic) deviations, a procedure 
for which there is strong logical iustifi^cation. 
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severe disturbances that occurred during the business cycle 
that ran its course between 1927 and 1933, and in the years 
immediately following, greatly complicate the task of disen- 
tangling the secular, seasonal, and cyclical elements in the 
behavior of this series. Not until a somewhat longer period 
has intervened will it be possible to determine the contribu- 
tions a changing secular trend and changing seasonal move- 
ments may have made to the fluctuations in railway freight 
traffic during the decade 1927-1937. 

In attempting to separate the results of secular, seasonal, 
cyclical, and random movements in the behavior of time 
series, it is well to establish a series of “expected” values, 
representing results of the operation of regularly acting 
forces. Most regular and predictable of the forces affecting 
time series are those defined as secular and seasonal. The 
equation to the line of secular trend of freight car loadings 
provides a means of estimating annual and monthly values. 
These would be the “expected” values were the forces of 
trend alone in operation. But we know that a seasonal 
movement, regular enough for fairly exact measurement, is 
superimposed upon the trend. The combination of the 
results of these two forces provides a basic series of “expected 
values,” from which deviations due to the play of other 
forces may conveniently be measured. 

A process suitable to this purpose is illustrated in Table 80. 
In col. (2) we have the monthly trend values of freight 
car loadings, and in col. (3) index numbers of seasonal 
variation. The products of the two, constituting the series 
of “expected values,” are given in col. (4). Thus, for Janu- 
ary, 1918, the expected number of freight cars loaded is 
not 782, the trend value, but 782 X .916, the latter figure 
being the seasonal index for January. This correction 
gives an “expected” number of 716. Subtracting from the 
actual values in col. (5) the corresponding expected values, 
we obtain the measurements in col. (6). The 655 cars 
loaded in January, 1918, fell short by 01 of the “expected” 
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Table 80 


Illustrating the Analysis of a Series in Time 
Freight Car Loadings, 1918-1927 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 


(7) 

Year 

and 

month 

Tre7id 

value 

Seasonal 
index 
{as ratio) 

Trend ‘ 
corrected 
for 

seasonal 

Actual 

value 

Deviolion of 
actual value 
from Hrend 
corrected for 
seasmial ’ 

Percentage 
deviation of 
actual value 
from ‘trend 
corrected for 
seasonal ’ 








A ■ 

- TS 


T 

S 

TS 

A 

A - 

• TS 

TS 

1918. 









Jan, 

782 

.916 

716 

655 

— 

61 

— 

8.5 

,.Feb. 

784 

.935 

733 

753 

+ 

20 

+ 

2.7 

Mar. 

786 

.972 

764 

842 

+ 

78 

+ 

10.2 

Apr. 

788 

.930 

733 

873 

+ 

140 

+ 

19.1 

May 

790 

.988 

781 

897 

+ 

116 

+ 

14.9 

June 

792 

1.010 

800 

918 

+ 

118 

+ 

14.8 

July 

794 

1.013 

804 

970 

+ 

166 

+ ! 

20.6 

Aug, 

796 

1.082 

I 861 

962 

+ 

101 

+ : 

11.7 

Sept. 

798 

1.107 

883 

956 

+ 

73 

+ 

8.3 

Oct. 

800 

1.140 

912 

925 

+ 

13 

+ 

1.4 

.Nov, 

802 

1.021 

819 

819 


0 


0 

Dec. 

804 

.886 

712 

719 

+ 

7 

+ 

0.98 

1919 









Jan. 

806 

.916 

738 

728 

, — 

10 

— 

1.4 

Feb. 

808 

.935 

755 

687 

— 

68 

— 

9.0 

Mar. 

810 

.972 

787 

696 

— 

91 

— 

11.6 

Apr. 

812 

.930 

755 

721 

— 

34 

— 

4.5 

May 

813 

.988 

803 

759 

— 

44 

— 

5.5 

June 

815 

1.010 

823 

796 

— 

27 

— 

3.3 

July : 

817 1 

1.013 

828 

887 

+ 

59 

+ 

7.1 

Aug., 

819 

1.082 

886 

892 

+ 

6 

+ 

0.7 

Sept. 

821 

1.107 

909 

960 

+ 

51 

+ 

5.6 

Oct, 

823 

1.140 

938 

967 

+ 

29 

+ 

3.1 

NOY,;. 

825 

1.021 

842 

807 

— 

35 

— 

4.2 

Dec. 

.827 

.886 

733 

758 


25 

+ 

3.4 

1920 









Jan. 

829 

.916 

759 

820 

+ 

61 

+ 

8.0 

Feb. 

831 

.935 

777 

776 

— 

1 

— 

0,1 

Mar. 

833 

.972 

810 

848 


38 

+ 

4.7 

Apr, 

835 

.930 

777 

730 

— 

47 

— 

6.0 

May 

837 

.988 

827 

862 

+ 

35 

+ 

4.2 

June 

839 

1.010 

847 

896 

+ 

49 

+ 

5.8 
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Table 80— Continued 


Illustrating the Analysis of a Series in Time 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

T 

S 

T8 

A 

A -TS 

A ~ TS 

1920 





-f* 49 

. +'5.8 

July 

841 

1.013 

852 

901 

Aug. 

843 

1.082 

912 

969 

+ 57 

+ . 6.3 

Sept. 

845 

1.107 

935 

967 

+ 32 

+ 3.4 

Oct. 

847 

1.140 

966 

1,005 

+ 39 

+ 4.0 

Nov. 

849 

1.021 

867 

884 

+ 17 

+ 2.0 

Dec. 

851 

.886 

754 

755 

+ ,1 

+ 0.1 

1921 






- 9.6 . 

Jan. 

853 

.916 

781 

706 

" - 75 

Feb. 

855 

.935 

799 

685 

-114 

- 14.3 

Mar. 

857 

.972 i 

833 

691 

- 142 

- 17.0 

Apr. 

859 

.930 

799 

706 

- 93 

- 11.6 

May 

861 ! 

.988 

851 

760 

- 91 

-10.7 

June 

863 

1.010 

872 

762 i 

- 110 

- 12.6 

July 

865 1 

1.013 

876 

750 

- 126 

- 14.4 

Aug. 

867 i 

1.082 

938 

810 

- 128 

- 13.6 

Sept. 

869 

1 . 107 

962 

842 : 

- 120 

- 12.5 

Oct. 

871 

1.140 

993 

932 

- 61 

- 6.1 

Nov. 

873 

1.021 

891 

764 

- 127 

-14.3 

Dec. 

875 

.886 

775 

681 

- 94 

- 12.1 

1922 ; 






1 - 13.3 

Jan. 

877 

.916 

803 

696 

- 107 

Feb. 

879 

.935 

822 

757 

- 65 

i - 7.9 

Mar. 

881 

.972 

856 

818 

- 38 

- 4.4 

Apr. 

883 

.930 

821 

716 

- 105 1 

- 12.8 

May 

885 

.988 

874 

776 

- 98 

- 11.2 

June 

887 

1.010 

896 

831 

— 65 

- 7.3 

July 

889 

1.013 

901 

813 

88 

- 9.8. 

Aug. 

891 

1.082 

964 

853 

- Ill 

- 11.5 

Sept. 

893 

1.107 

989 

925 

- 64 

- 6.5 

Oct. 

895 

1.140 

1,020 

978 

- 42 

- 4.1 

Nov. 

897 

1.021 

916 

957 

+ 41 

+ 4.5 

Dec. 

899 

.886 

797 

832 

"f" 35 

+ 4.4 

1923 

Jan. 

900 

.916 

824 

848 

”4* 24 

+ 2.9 

Feb. 

902 

.935 

843 

854 

+ 11 

+ 1.3 

Mar, 

904 

.972 

879 

j 916 

+ 37 

+ 4.2 

Apr. 

906 

.930 

843 

941 

-f 98 

+ 11.6 

May 

908 

.988 

897 

1 975 

+ 78 

+ 8.7 


•P 

■m 
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Table 80 — Continued, 
Illustrating the Analysis of a Series in Time 


(1) 

(2) 

T 

(3) 

S 

(4) 

TS 

(6) 

A 

( 6 ) 

A - TS 

4 

( 7 ) 

- TS 
TS . 

1923 









June 

910 

1.010 

919 

1,012 

+ 

93 

+ 10.1 

July 

912 

1.013 

924 

985 

+ 

61 

+ 

6.6 

Aug. 

914 

1.082 

989 

1,042 

+ 

53 

+ 

5.4 

Sept. 

916 

1.107 

1,014 

1,037 

+ 

23 

+ 

2.3 

Oct. 

918 

1.140 

1,047 

1,070 

+ 

23 

+ 

2.2 

Nov. 

920 

1.021 

939 

964 

+ 

25 

+ 

2.7 

Dec. 

922 

.886 

817 

827 

+ 

10 

+ 

1.2 

1924 









Jan. 

924 

.916 

846 

859 

+ 

13 

+ 

1.5 

Feb. 

926 

.935 

866 

908 

+ 

42 

+ 

4.8 

Mar. 

928 

.972 

902 

916 

+ 

14 

+ 

1.6 

Apr. 

930 

.930 

865 

874 

+ 

9 

+ 

1.0 

May 

932 

.988 

921 

895 

— 

26 

— 

2.8 

June 

934 

1.010 

943 

906 

— 

37 

— 

3.9 

July 

936 

1.013 

948 

881 

— 

67 

— 

7.1 

Aug. 

938 

1.082 

1,015 

969 

— 

46 

— 

4.5 

, Sept. 

940 

1.107 

1,041 

1,037 

— 

4 

— 

0.4 

Oct. 

942 

1.140 

1,074 

1,091 

+ 

17 

+ 

1.6 

Nov. 

944 

1.021 

964 

976 

+ 

12 

+ 

1.2 

Dec. 

946 

.886 

838 

869 

+ 

31 

+ 

3.7 

1925 









Jan. 

948 

.916 

868 

891 

+ 

23 

+ 

2.6 

Feb. 

950 

.935 

888 

906 

+ 

18 

+ 

2.0 

Mar. 

952 

.972 

925 

926 

+ 

1 

+ 

0.1 

Apr. 

954 

.930 

887 

932 

+ 

45 

+ 

5.1 

May 

956 

.988 

945 

971 


26 

+ 

2.8 

June 

958 

1.010 

968 

992 

+ 

24 

+ 

2.5 

July. 

960 

1,013 

972 

975 

+ 

3 

+ 

0.3 

Aug. 

962 

1.082 

1,041 i 

1,073 

+ 

32 

+ 

3.1 

Sept. 

964 

1.107 

1,067 ' 

1,074 

+ 

7 . 

+ 

0.7 

Oct. 

966 

1.140 

1,101 

1,107 

+ 

6 

+ 

0.5 

Nov. 

968 

1.021 

988 

1,024 

+ 

36 

+ 

3.6 

Dec. 

■970 

.886 

859 

925 

+ 

66 

+ 

7.7 

192G 









Jan. 

.972 

.916 

890 

920 

+ 

30 

+ 

3.4 

Feb. 

974 

.935 

911 

932 

+ 

21 

+ 

2.3 

Mar. 

976 

.972 

949 

960 

+ 

11 

+ 

1.2 

Apr. 

978 

.930 

910 

966 

+ 

56 


6.2 

May 

980 

.988 

968 

1,018 

+ 

50 

+ 

5.2 
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Table 80 — Continued 
Illustrating the Analysis of a Series in Time 


1927 

Jan. 995 . 916 911 944 + 33 + 3.6 

Feb. 997 . 935 932 956 + 24 + 2.6 

Mar. 999 . 972 971 998 + 27 + 2.8 

Apr. 1,001 .930 931 9G9 +38 +4.1 

May 1,003 .988 991 1,004 + 13 + 1.3 

June 1,005 1.010 1,015 1,021 + 6 + 0.6 

July 1,007 1.013 1,020 978 - 42 - 4.1 

Aug. 1,009 1.082 1,092 1,073 - 19 - 1.7 

Sept. 1,011 1.107 1,119 1,093 - 26 - 2.3 

Oct. 1,013 1.140 1,155 1,101 - 54 - 4.7 

Nov. 1,015 1.021 1,036 926 - 110 - 10.6 

Dec. 11,017 . 886 901 814 - 87 - 9.7 

number, 716. Such deviations of actual values from “trend 
corrected for seasonal” represent the combined influence 
of cyclical and accidental factors. These may be utilized 
in the absolute form given in col. (6), or may be expressed 
in percentage terms as in col. (7) of Table 80. 

The series defining trend values corrected for seasonal 
variations, which are given in cols. (6) and (7) of Table 80, 
furnish the most satisfactory bases from which cycles in 
economic series may be measured. It is true that the 
“cycles” in cols. (6) and (7) are distorted by accidental 
fluctuations, but there is no simple means by which these 
may be eliminated. Recognizing their presence, the series 
may be put to fruitful use in the study of cyclical movemeut.s.’ 

^ A series of corrected deviations from trend may be secured by subtract- 
ing the Indices of seasonal variation from a aeries in which actual values are 
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The analysis of this series may be followed through 
graphically in Figs. 63 and 64 on page 300. The actual data of 
freight car loadings, by months from 1918 to 1927, are plotted 
in Fig. 63, together with a straight line of trend. In 
addition, a series of expected values (the figures in col. [4] 
of Table 80) is given for comparison with the actual. In 
this chart the seasonal pattern, shown by the dotted line, is 
superimposed upon the trend. Fig. 64 shows the deviations 
of actual from expected values, in percentage terms. These 
constitute the “cycles” in freight ear loadings. As we have 
noted, random elements as well as cyclical fluctuations proper 
are present in these deviations. It would be possible, by 
u Min g three- or five-month moving averages on these devia- 
tions, or by other smoothing processes, to eliminate some of 
the effects of the accidental movements. But the random and 
the cyclical movements are so closely interwoven that the 
attempt at separation is not generally made. 

If cyclical changes in this series are to be compared with 
similar changes in other series, it is desirable to reduce the fig- 
ures to a form permitting such comparison. The percentage 
deviations might be much more violent in one series than in 
another, and without a common denonoinator comparison 
would be difficult. This common denominator is afforded by 
the standard deviation. The monthly or annual deviations 
may be expressed in terms of the standard deviation as the 
unit of measurement, if such comparison is to be made. 

(Footnote t continued f rom, page ^98.) 

' '■ A ' ' ' 

given as percentages of corresponding trend values. That is, ~ ~ /S may be 
A TS ' ' ' 

employed, instead of — This usage, which involves the assumptions 

1 o 

that the cyclical-accidental composite and seasonal variations both repre- 
sent deviations from trend as base and that their influences are additive, is 
not as strong, logically, as the method exemplified in the text. Trend and 
seasonal forces are the constant factors in the behavior of time series. In 
combination they may be thought of as providing the base from which cycli- 
cal and accidental movements occur, as deviations. (This is a convenient, and 
perhaps not a faulty, conception. We do not, however, possess knowledge of 
the true organic relations among the elements of time series.) 


m 
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18 I 1919 I 1920 I 1921 | 1922 | 1923 | 1924 | 1925 | 1926 j 1927 j 

Cyclical and Accidental Fluctuations in Freight Car Loadings in the X^nited States, 1918-1927 
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The process of analysis has now been completed. We 
have, for the given series, the equation to the line of secular 
trend, and from this the normal or trend value at any 
given date may be computed. The seasonal variations have 
been measured, and indices of these variations computed. 
Finally, the cyclical fluctuations (plus the unmeasurable 
random and accidental changes) have been isolated. These 
measurements of cyclical fluctuations may be used in 
studying the sequences of change in different economic 
series during business cycles, in comparing economic series 
in respect of the amplitude or duration of their cyclical 
movements, and in various other ways in the analysis of 
business cycles and the planning of business operations. 
Some of these applications are discussed in later sections. 

General Considerations 

Certain considerations not specifically mentioned above 
should be borne in mind in subjecting time series to the 
type of analysis described in this chapter. It is essential 
that the data employed be homogeneous, as regards sources, 
methods of quotation, coverage, etc. In addition, homo- 
geneity in the conditions underlying the behavior of the 
particular series which are the objects of study is assumed. 
Homogeneity, as the term is here used, may not be defined 
in absolute terms. New factors are constantly being inter- 
jected into economic and social life. Homogeneity cannot 
be taken to mean static conditions. Yet the change must 
be orderly and, as regards major movements, reasonably 
continuous if the kind of analysis here discussed is to yield 
results. Abrupt dislocations that suddenly alter prevailing 
trends and existing seasonal patterns break the necessary 
homogeneity of statistical series. If the forces that caused 
these dislocations persist, and operate in orderly fashion, 
we mark a break in our series and subject the new period 
to analysis in its turn. 

For the determination of a line of trend and the calcula- 
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tion of indices of seasonal variation, data extending over 
as long a period as possible should be employed (subject 
to the preceding qualification regarding fundamental dis- 
continuities). Ten years may be suggested as a minimiun 
period, though a much longer term of years is desirable. 
If interest attaches to cycles of long duration, rather than 
to the short-period business cycles with which the preceding 
account is concerned, our concept of trend, as well as that 
of cycles, must be modified. The minimum time period 
suitable for study must be correspondingly lengthened. 

If a relatively short term of years is employed in the 
determination of trend, it is important that the terminal 
years be neither exceptionally high nor exceptionally low, 
as a result of cyclical or accidental movements. In general, 
the cyclical movements in the terminal years should be 
in “symmetrical phases,” in Crum’s phrase. Thus a cyclical 
rise at the beginning of the period should be balanced by 
a cyclical decline at the end. 

It is logically improper to make correction for assumed 
seasonal movements in a time series unless the existence 
of true seasonal variations has been established. That is, 
a test should be applied to determine whether the observed 
departures of the various monthly indices from their aver- 
age value (100) are attributable to the play of chance, or 
whether a true seasonal pattern is present. The basis of 
such tests of significance is discussed in Chapters XIV 
and XVIII, and a method appropriate to the present problem 
is developed in Chapter XV. 

In fitting a line of trend, computing indices of seasonal 
variation and deriving, finally, a set of residual figures 
which are taken to reflect the cyclical fluctuations in an 
econonaic series we are, of course, abstracting from reality. 
AS in all such abstractions, caution is necessary. Assump- 
tions implicit in the various steps taken are likely to be 
forgotten. Thus the “cycles” plotted as deviations in Fig. 64 
are distorted not only by the random and irregular fluctua- 
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tions to which attention has already been called. To the 
extent that the trend is inadequately or inaccurately defined 
by the particular function used, residual errors are present 
in the deviations. To the extent that seasonal movements 
are inaccurately measured by the seasonal indices employed, 
other residual errors are present. And if the trend is pro- 
jected beyond the period covered by the fitting process, 
or if seasonal indices are used for periods not included in 
their calculation, new sources of possible error are intro- 
duced. The “cycles” that appear so definite and clear-cut 
in our tables may contain more than traces of many non- 
cyclical elements. It is often desirable to employ methods 
of analysis that carry us far from the original observations, 
but the dangers of misinterpretation and error are multiplied 
as we abstract from the reality of economic processes and 
business operations. 

The methods of time series analysis described in this 
and the preceding chapter are adapted to a variety of 
economic and business purposes. But they do not constitute 
the only means of attack, in dealing with series ordered 
in time. Special problems may necessitate the use of some- 
what more elaborate procedures.^ For some purposes simpler 
methods will suffice. For other purposes it may be invalid 
to attempt to isolate and measure separately the influence 
of secular, seasonal, and cyclical forces. Economic science 
has yet to deternnne the precise natme of the interrelations 
among these categories of forces. In the light of this fact 
the discerning statistician will adapt his methods to the 
requirements of individual problems, as they arise. 
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CHAPTEE IX 


INDEX NUMBERS OF PHYSICAL VOLUME 

Comprehensive and accurate records of physical pro- 
duction are of central importance to business interests, to 
government, and to economists. The appraisal of the mar- 
ket and the intelligent planning of production programs 
require knowledge of past production trends and present 
conditions. The credit policies of banking authorities and 
monetary policies of federal agencies are determined in 
good part with reference to the physical volume of goods 
being produced and marketed. The phases of business 
cycles are probably traced with more accuracy by produc- 
tion movements than by changes in any other economic 
element. The directions in which the productive efforts 
of an economy are being exerted are defined by records of 
the output of goods of different classes, such as capital 
goods and consumption goods. Changes in the course of 
years in the true standard of living of a nation must be 
measured in terms of the aggregate of physical goods pro- 
duced. 

The last twenty years have witnessed notable enlarge- 
ments of the scope and improvements in the accuracy of 
measurements of production in the United States. Efforts 
of federal agencies, private organizations, and trade associa- 
tions have combined to provide materially better statistics 
of output in agriculture, mining, and manufacture. More 
recently records of the volume of trade have been broadened 
and made more accurate. There are important gaps still, 
particularly as regards the output of finished, highly fabri- 
cated goods not easily enumerated in units of constant 
quality. But the statistics we have provide a full and 

805 
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reasonably accurate record of monthly and annual move- 
ments of production. 

Here, again, we face the problem of combining series 
relating to individual commodities. For scattered data on 
the output of oats, coal, gasoline, pig iron, automobiles, 
etc. do not define the general changes in output that are 
of interest to persons concerned with the larger aspects 
of eeononaic change. He who would study the course of 
general production encounters a problem much like that 
presented to the student of general price movements. If 
the general trend of production is to be determined, or 
if the cyclical or seasonal swings of production are to be 
studied, the mass of individual figures must be reduced to 
the form of a single index, the significance of which may 
be easily comprehended. The present chapter deals with 
methods appropriate to the construction of such indices. 

Index Numbers op Production Unadjusted for Trend 
AND Seasonal Movements 

An immediate and obvious obstacle to the combination 
of measures of output for different industries arises from 
differences in the imits employed. Since bushels, tons, and 
gallons may not be added directly, the simple aggregative 
type of index is ruled out. One method of overcoming this 
difficulty is to reduce to relative terms the several output 
series that are to be combined. A relative number measuring 
tffe output of petroleum in 1936 as a percentage of output 
in 1922 may be averaged with similar relatives for bituminous 
and anthracite coal. The average may be a simple one, 
or the relatives for the several commodities may be weighted 
in proportion to the importance of the conmiodities in 
question. This procedure was illustrated in detail in the 
opening pages of Chapter VI. 

An alternative method is to employ an index of the 
weighted aggregative type, keeping quantities constant as 
between two periods teing compared. In 1917, according 
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to the computations of the Price Section of the War Indus- 
tries Board, the total value of the output of 90 raw materials 
in the United States was 34,748 millions of dollars. This 
figure represents, of course, a value total of the t 3 T)e 
S( 3 i 9 i 7 ?>i 9 i 7 ) where qmr represents the quantity of a given 
raw material produced in 1917 and pmr represents the 
average price of that commodity in 1917. In 1918 both 
quantities and prices were different. If, however, we obtain 
another value aggregate using 1918 quantities and 1917 
prices we shall have a figure differing from that for 1917 
only in respect of the quantity factor. For the 90 raw mate- 
rials in question this total, which is represented by the 
expression 'S(qimPim) amounted to 135,169,000,000. The 
totals for 1918 and 1917 are comparable, being both in 
dollar units. The difference between them measures the 
change in physical production between the two years. As 
an index of this change we have 


S($I917Pl917) 


_ $35,169 _ 9 

" $ 34 ;^ - 


This index will be recognized as one of the aggregative 
types discussed in Chapter VI, except that the p’s and the 
q’s are interchanged. When information concerning both 
quantities produced and average per unit prices is available, 
these aggregative indices, or the “ideal” index which is 
a combination of two such aggregative measures, may be 
employed for the measurement of quantity changes as well 
as for price changes. The “ideal” index, when used for 
this purpose, takes the form 


7 = 


1/; 


^(giPo) 

S(g-oPo) 


w S(gipi) 


where qo and po represent the quantities and prices of the 
individual commodities in the base year, while qi and pi 
represent quantities and prices in the given year. The 
procedure in the computation of such an index is identical 




308 INDEX NUMBERS OF VOLUME 


with that employed in computing the “ideal” price index, 
with prices and quantities reversed. This formula may be 
modified, as was the corresponding price index, to 

S(po H~ Pi)gi 
S(po + Pi){?o 

or to a form in which the p’s come from some intermediate 
year. In one form or another, the aggregative type of 


All Commodities 
Capital Goods 
Consumption Goods 


Fig. '6*5, — Changes in the Physitjal Volume of I^ianufactiiring .Production 
in the United States, 1914~1935. Ail Commodities, Capital, Goods and 
Consumption Goods 

index is well adapted to the requirements of an index of 
physical volume.* • 

The aggregative procedure lends itself readily to the con- 

* Since the price or value factor outers in the tierivaiion of such an index, 
whether it be coBstructed from relative numbers or from value aggregates, 
no quantity index is completely divorced from pecuniary measurements. For 
a discussion of tins point, and of other logictd prcjblems involved in the eon- 
stmction index numbers of prodindion, see Arthur F. Burns, “The Measure- 
mfnt of the Pliysical Volume of Production/’* Qu&rterly Journal of McofWfuic^Hf 
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struction of index numbers for commodity groups. This 
is desirable in the study of production movements, as it 
is for prices. The significant features of production changes 
over a given period may be far more clearly revealed by 
measurements of relative changes in the output of different 
classes of goods than by a general index of production. 

Changes in the volume of production of various classes 
of manufactured goods during the period 1914-1935 are 
indicated by the following measurements, constructed by 
the National Bureau of Economic Research. The basic 
data, which were compiled by the Census of Manufactures, 
provide the quantity and (by derivation) the unit price 
records required for the “ideal" formula. That formula, 
slightly modified for working purposes, was employed in 
the construction of these index numbers. 


Table 81 


I-ndex Numbers of the Physical Volume of Production of 
Manufactured Goods in the United States, 1914-1935 ’ 



All 

mdustries 

Durable 

goods 

Semi- 

durable 

goods 

Perish- 

able 

goods 

Goods 
destined 
for himan 
consump- 
tion 

Goods 
destined 
for capital 
equipment 
and con- 
struction 

1914 i 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

1919 

129.5 

141.7 

120.9 

123.2 ! 

129.1 

129.5 

1921 

104.5 

99.6 

104.6 

108.9 

109.4 

91.8 

1923 

155.8 

183.7 

140.2 

135.4 

150.4 

164.3 

1925 

159.5 

185.2 

141.8 

144.4 

154.0 

167.7 

1927 

163.3 

177.2 

151.0 

154.9 

159.5 

166.7 

1929 

183.7 

210.9 

162.5 

170.9 

1 177.7 

192.0 

1931 

138.2 

112.3 

137.4 

154.9 

I 146.9 

103.7 

1933 

128.0 

91.4 

140.1 

144.4 

142.6 

81.3 

1935 

: 160 . 5 ' 

1 143.9 

164.4 

163.9 

171.3 

122.5 


Selected measurements from Table 81 are shown graphi- 
cally in Fig. 65. 

* Constructed by the National Bureau of Economic Research, New York. 
See Economic Tendencies in the United States fov a statement on procedure. 
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Adjusted Index Numbees of Peoduction 

In the analysis of time series we have seen that cyclical 
fluctuations are often the objects of primary interest. This 
is particularly true in the study of physical volume, for 
changes in the volume of production and trade are features 
of fundamental importance in business cycles. Methods 
have been explained, in the preceding chapters, by means 
of which we seek to measure the cyclical fluctuations in 
individual series (fluctuations which are inextricably entan- 
gled with accidental movements of major and minor degree). 
An obvious next step, in the study of general business 
conditions, is the combination of the cyclical-accidental 
movements in a number of series into a single index. The 
utility of such an index of changes in the physical volume 
of production in the course of the business cycle is evident. 

When annual data are employed the construction of an 
index of these cyclical changes is simple. No problem of 
seasonal variation enters, and secular trend alone has to 
be taken account of. Two different methods by which 
this may be done present themselves. Edmund E. Day, 
a pioneer in this field of economic research, has tested both 
methods. 

The first involves the fitting of an appropriate line of 
trend to each of the constituent series. The actual items 
are then expressed as percentages of the corresponding 
trend values. When this has been done for each series, 
the final adjusted index for a given year is obtained by 
taking a weighted average of these percentages for that 
year. Each commodity may be weighted in this averaging 
process, as in the calculation of the unadjusted index. 
The resulting adjusted index is in terms of relatives, but 
these relatives refer to a hypothetical “normal,” instead 
of to any fixed base. This is the desired index of cyclical- 
accidental changes in the physical volume of production. 
With monthly data the process is the same, except that. 
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before being averaged, the deviations from trend are cor- 
rected to eliminate the influence of recurrent seasonal 
movements. 

In the process of averaging deviations from trend, account 
should be taken of the relative variability of the series 
being combined. As an example, we may consider the 
combination of data of pig iron production and cattle 
receipts in a general index of production. Reducing pig 
iron production to terms of “seasonably adjusted deviations 
from trend,” we obtain a series marked by rather extreme 
fluctuations. The standard deviation of this adjusted 
series, for a given period, was 27. For cattle receipts, cor- 
respondingly adjusted, the standard deviation was 11. In 
any combination of the two series of percentage deviations 
the more widely fluctuating pig iron measurements will exer- 
cise a dominant influence, unless correction is made. The use 
of weights defining the relative economic importance of the 
two series will not prevent distortion due to the greater 
variability of the pig iron series. 

One way out of the difficulty is to divide the deviations 
from trend by the respective standard deviations, before 
averaging. This gives an index in standard deviation units. 
Another procedure involves the combination of the “eco- 
nomic weight” and the standard deviation of each series 
in a weighting factor to be applied directly to the percentage 
deviations from trend. The economic weight is divided 
by the corresponding standard deviation, in making the 
combination. The method is illustrated below. 

Economic Standard Economic weight 

weight deviation -f- standard deviation 

20 27 .747 

4 11 .363 

The final weighting factors are the figures given in the last 
column. These may, of course, be rounded off when a 
number of series are to be averaged.^ 

^ This useful method of combining economic weights and standard devia- 


Series 

Pig iron production 
Cattle receipts 
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The alternative method of combining economic series is 
simpler. Each unadjusted index possesses a trend which is 
“a composite of the persistent tendencies of the several 
original series upon which the unadjusted index is based.” 
It is possible to measure this trend, instead of the separate 
original trends, and secure the adjusted index directly from 
the unadjusted. Day’s results indicate that there is no 
loss of accuracy in the use of the simpler method. 

AN INDEX OF INDUSTRIAL ACTIVITY 

This procedure, with certain modifications, is well exempli- 
fied in an “Index of Industrial Activity in the United 
States,”- constructed by the Chief Statistician’s Division 
of the American Telephone and Telegraph Company. ^ The 
elements of this index are monthly data ; seasonal corrections 
are therefore necessary. Wlien thc;se corrections have been 
made a general index measuring long-term growth and 
cyclical-a<!cidental flu(;tuations, in combination, is con- 
structed by averaging 11 .series, with appropriate weights. - 
This index is shown for th(! period 1899-1937, with line 
of trend, in Fig. 60. The trend line was fitted by least 
squares to data for the period 1899-1930, with the war 
years, 1917-1918, omitted. 

When each monthly value of the index is expressed as 
a percentage deviation from the corresponding trend value, 
the measurements presented in Table 82, and graphically 
portrayed in Fig. 67, are obtained. The cyclical-accidental 

tions ImB been employed by G. W. Starr, Director of the Bureau of Business 
Research of Indiana University. I am indebted to him for the example. 

^ This index has been constructed for the use of the staffs of the Bell system 
companies, and is not available for distribution- It is published here by cour- 
tesy of the American Teleplione and Telegraph Company. 

2 The following series were used for the later years of the period covered: 
steel ingot production, pig iron production, automobile passenger car produc- 
tion, building contracts a wan led (on a square foot basis), cotton consumption, 
woo! consumption, slaughter of catth? and hogs, newsprint consumption, mis- 
cellaneous freight car loadings, electric power consumption, and employment 
in manufacturing industries. Since employment is incliKicd, the index goes 
slightly beyond the field of strict physit‘al production. It is intended to be 
m index of industrial activity. 
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iiuctuations in industrial activity, as represented by the 
11 series employed, are traced by the movements of this 
index. 

Index of Indtjsteial Phoduction of the Board op 
Governors of the Federal Reserve System 

A comprehensive monthly index of production in mining 
and manufacturing industries of the United States is con- 
structed by the Division of Research and Statistics of the 
Board of Governors of the Federal Reserve System. This 
index is designed to serve current needs. In the selection 
of its components emphasis has been placed upon the 
promptness with which basic data on the output of industrial 
commodities become available, as well as upon their accuracy 
and representativeness. 

The chief points of general interest relating to this index 
may be briefly noted. 

Coverage. The index is derived from 60 individual series, 
measuring production in some 35 industries. Approximately 
80 per cent of the total industrial production of the United 
States is represented directly or indirectly in the index. 

Base -period. The base of the published relatives is daily 
I average production during the three years 1923, 1924, and 
1 1925. The final indices appear as relatives on this base, 
I both with and without seasonal correction. 

Character of daia used. For each commodity production 
is computed in terms of average output per working day. 
By this method distortion due to changes from one month 
to the next in the number of Sundays and holidays included 
is avoided. 

Form of index number. The index is of the weighted 
faggregative type. Original quantity figures are multiplied 
by weighting factors which convert them into common 
junits (i.e., values, in dollars). In deriving the final index, 
aggregate for a given date is expressed as a percentage 
of the base-period aggregate. 
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Weighting factors. For mineral products the weight for 
each commodity is its average per unit value in the base 
period. For manufactured products the weight for a given 
commodity is the per unit “value added” (i.e., added by 
manufacture), modified to the extent that the commodity 
in question is taken to represent other manufactured prod- 
ucts not directly included in the index. These “weights” 


thus correspond to p’s in the aggregative formula 


S(giPo) 

S(goPo) 


except that for a manufactured product the p is a “price” 



Fig. 68. — Physical Volume of Industrial Production in the United States, 
1919-1937 (1923-1925 average = 100) 


for the services of agents of fabrication, with a modification 
to allow the given commodity to represent similar products 
for which quantity data are not available. The weights 
for manufactured goods were drawn from the Census of 
Manufactures for 1923. The po used to weight the q for 
manufactures is thus not strictly a base-period price,^ 
Adjustment for seasonal variation. No correction for trend 
is made, but in one form of the index an adjustment is 
made to eliminate the effect of seasonal fluctuations in the 


^ Weighting factors were modified for the period 1919-1922 by the com- 
bination of weights for 1919 with those for the base period. 
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Table 83 


Index of Industrial Production, Board of Governors of the Federal 
Reserve System, 1919-1937 


Month 

(Adjusted for seasonal variation. 
1919 1920 1921 1922 1923 

1923-1925 average = 

1924 1925 1926 

100) 

1927 

1928 

Jan. 

82 

95 

67 

73 

99 

100 

105 

106 

107 

107. 

Feb. 

79 

95 

66 

76 

100 

102 

104 

105 

108 

109 

March 

76 

93 

64 

80 

103 

100 

103 

106 

110 

108 

April 

78 

88 

64 

77 

106 

95 

102 

107 

108 

108 

May 

78 

90 

66 

81 

106 

89 

102 

106 

109 

108 

June 

83 

91 

65 

85 

106 

85 

102 

108 

107 

108 

July 

87 

89 

65 

85 

104 

84 

103 

108 

106 

109 

Aug. 

89 

89 

67 

83 

103 

89 

103 

110 

106 

no 

Sept. 

87 

86 

68 

88 

100 

94 

101 

111 

104 

113 

Oct. 

86 

83 

71 

93 

99 

95 

104 

111 

102 

115 

Nov. 

85 

76 

71 

97 

98 

97 

^ . 107 

110 

101 

117 

Dec, 

86 

72 

70 

100 

97 

101 

109 

107 

102 

118 

Annual 

index 

83 

87 

67 

85 

101 

95 

104 

108 

106 

111 

Month 

1929 

1930 

1931 

1932 1933 

1934 

1935 

1936 

1937 

Jan, 

119 

106 

83 

72 


65 

78 

90 

97 

114''' 

Feb. 

118 

107 

86 

69 


63 

81 

89 

94 

116 

March 

118 

103 

87 

67 


59 

84 

88 

93 

118 

April 

121 

104 

88 

63 


66 

85 

86 

101 

118 

May 

122 

102 

87 

60 


78 

86 

85 

101 

118 

June 

125 

98 

83 

59 


91 

83 

87 

104 

114 

July 

124 

93 

82 

58 


100 

76 

86 

108 

114 

Aug. 

121 

90 

78 

60 


91 

73 

88 

108 

117 

Sept. 

121 

90 

76 

66 


84 

71 

91 

109 

111 

Oct. 

118 

88 

73 

67 


76 

73 

95 

no 

102 

Nov. 

110 

86 

73 

65 


72 - 

74 

96 

114 

. 88 

Dec. ■ 

103 

84 

74 

66 


75 

86 

101 

121 

84 

Annual 

index 

119 

96 

81 

64 


76 ■ 

79 

90 

105 

no 


output of individual commodities. Seasonal indices were 
computed by averaging the ratios of actual data to twelve- 
month moving averages. (See Chapter VIII.) Where there 
was evidence of progressive change in the seasonal pattern, 
the seasonal adjustments for a given commodity were 
modified from year to year. The actual adjustment for 
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seasonal change is made by dividing the daily average 
output of a given commodity in a stated month by the 
seasonal index for that month, expressed as a ratio (i.e., 
as 1 . 10, if the conventional index were 110). The seasonally 
adjusted q would thus be reduced if the seasonal index 
were above 1.00, raised if the seasonal index were below 
1.00. In the construction of the seasonally corrected 
index, these adjusted q’s are used in the aggregative formula 
previously described.^ 

Monthly values of this index are given in Table 83, for 
the period 1919-1937. The index is shown graphically in 
Fig. 68 on page 317. 

Derived Indices op Production and Productivity 

It is possible, where suitable records of value of product 
and indices of price changes are available, to derive an 
index of production by indirection. In the case of a single 
commodity it is obvious that pq p - q- (Here q repre- 
sents the number of physical units produced, p represents 
average per unit price, and pq is the aggregate value.) 
A similar process is possible in handling statistics relating 
to a number of commodities, in combination. Indeed, the 
records may be in the form of relatives, or index numbers, 
covering a number of months or years. Division of a value 
index by a price index relating to the commodities included 
in the value index will yield an index measuring changes 
in physical output. 

This procedure may sometimes be used to obtain meas- 
urements that could not possibly be built up by combining 
a number of individual records. Whether the method is 
applicable in a given instance depends upon the compara- 
bility of the price and value index numbers. The strict 

^ A detailed description of the constituents of this index and of the pro- 
cedure employed in its construction is given in the Federal Eeserve Bulletin 
for February, 1927. Revisions are noted in the issues of that Bulletin for 
March, 1932, Sept., 1933, Nov., 1936, and March, 1937, The index appears 
in current Issues of the Federal Reserve BvEelin, 




conmiodities included in the value index cannot generally 
be met. If we assume that a given price index is fairly 
representative of the conmiodities covered by the value 
records, and if the formula employed in the construction 
of the index is appropriate, the method may be justified 
as a means of approximating the required index of physical 
output. 

An example of such a procedure is furnished by the 
materials in Table 84. These illustrate a method used in 
deriving an index of production of manufactured goods. 
The indices in col. (3) are derived directly from the aggre- 
gate figures on “ value added by manufacture.” The indices 
in col. (4) measure changes in average “value added” per 


Total 

value added, 
all census industries 

Year 

(in (in 

millions rela-- 

of dollars) tives) 

1923 25,850 100.0 

1925 26,778 103.6 

1927 27,585 106.7 

1929 31,844 123.2 

^ This tiihic is takwi from Mcoftottiit' I'e-udcticiGs 
York, National Huroau of Economic llcsearch, 308. 


Index of 
lYiim added 
per unit of 
product , 
Mustries 
inclmled in 
sample 

100.0 

97.3 

92.4 
96.8 


, Berived 
ifidex of 
physical 
volume 


production 


measurements op productivity S21 

relating to all industries, is derived by dividing the relatives 
for total “value added” by the index numbers measuring 
changes in “value added” per unit of product (with a 
suitable shift in the decimal point). 

The derived measurements given in col. (5) of Table 84 
are probably more accurate than index numbers based upon 
directly enumerated physical products. For the gaps in 
the coverage of the latter are serious. Limitations of coverage 
are the more serious in that the excluded industries are 
in many cases just the new, rapidly developing industries 
the output of which is growing most rapidly. 

A somewhat similar process of derivation is employed in 
the construction of measurements of industrial productivity. 
It is impossible, by direct observation, to compile records 
of output per man or per man-hour over any considerable 
area of industrial activity. However, given accurate indices 
of physical production and comparable records of number 
of workers employed or of man-hours worked, one may 
derive index numbers measuring changes in productivity. 

An example of this procedure is given in Table 85. The 
measurements given should be regarded as approximations 
only. 

Table 85 

Index Numbers of Physical Volume of Production, Man-Hours 
Worked and Output per Man-Hour, Manufacturing 
Industries of the United States, 1929-1935 



Physical volume 

Total number 

Estimated 

Yrnf' 

of manufacturing 

of man-hours 

output per 


production 

worked 

man-hour 

1929 

100 

100 

100 

1931 

75 • ■ ' 

66 

114 

1933 

70 

60 

117 

1935 

■ ' 87 

70 

124 


Between 1929 and 1935 the total volume of manufacturing 
production declined 13 per cent, The number of man- 
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hours worked decreased by 30 per cent, however. The 
indicated gain in output per man-hour was 24 per cent. 

Measurements such as these are of unquestioned value 
to the student of industrial change, but their limitations 
should be clearly stated. The initial necessity of full com- 
parability between the output and employment records has 
been mentioned. Discrepancies here may lead to serious 
errors in the derived measurements. More difficult to 
detect are technical industrial changes that do not appear 
in the statistical records. Changes in the quality of the 
goods represented in the production index may lessen the 
accuracy of that index, and affect the productivity measure- 
ments. If employment is measured in terms of number of 
men employed, the resulting index of per capita output 
may be seriously distorted by changes in the length of the 
working week. Again, if only direct labor is enumerated 
in the employment index, a shift in technical methods that 
involves the use of a much larger proportion of indirect 
labor may lead to a great advance in apparent productivity, 
which far exceeds the real gain. Some of the gain that 
apparently follows the increased mechanization of a plant 
or a process is of this fictitious sort. Labor that precedes 
the direct act of production, and servicing and supervising 
labor, may have replaced direct labor. Failure to take 
account of the contributions of these indirect applications 
of labor may lead to grossly exaggerated measures of 
productivity gains. 

The purpose of the preceding pages has been to exemplify 
procedures used in the measurement of changes in produc- 
tion, with incidental reference to related problems. VTiile 
there is no one standard method, it will be clear that the 
construction of quantity index numbers requires no involved 
procedure. Certain special problems — of weighting, of 
measuring secular and seasonal movements, of ensuring 
comparability when methods of derivation are employed ■ - 
have been noted. In addition, most of the problenas that 
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bulk large in the construction of price index numbers axe 
faced in this area also. The task of obtaining accurate, 
homogeneous series of basic data entails no less careful 
field work in production than in prices. Quality changes 
lessen the accuracy of both types of index nmnbers. Com- 
parisons over considerable time periods are rendered inaccu- 
rate by such quality changes and perhaps even more by 
changes in ‘‘regimen” — in the complex of tastes, consuming 
habits, and technical methods that determines the weighting 
factors used in the construction of index numbers. In spite 
of these difficulties substantial progress has been made in 
recent years in the improvement of measures of industrial 
activity of the type discussed in this chapter. More com- 
prehensive and more accurate data are being compiled, 
and technical standards in the construction of index numbers 
are being steadily raised. These gains are contributing to 
a notable advance in our knowledge of economic processes. 
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CHAPTER X 


THE MEASUREMENT OF RELATIONSHIP: LINEAR 
CORRELATION 

In discussing averages and measures of dispersion and 
skewness we have been dealing with methods of describing 
a single frequency distribution. The arrangement of the 
values of a single variable along a scale may be portrayed 
by means of these measures, which enable the central value 
to be determined and the character of the distribution 
about that central value to be described. In the analysis 
of time series a somewhat different problem has been faced. 
In such cases we are concerned with the changing values 
of a variable factor with the passage of time, and seek to 
determine the degree to which the changes in value are 
due to the play of different forces — the secular trend and 
cyclical, seasonal, and accidental factors. The preceding 
chapters dealt with methods by which we might measure 
the effect upon a given series of each of these factors (with 
the exception of accidental fluctuations). 

Certain of these methods are applicable to the problem 
now before us. It was found that in dealing with time 
series the relationship between time and the long term trend 
factor may be described by a definite mathematical equa- 
tion. That is, trend or growth seems to be a function of 
time for many economic series. Where such a relationship 
prevails, whether it hold precisely or only approximately, 
there is a distinct advantage in securing a mathematical 
expression which describes it. A similar but much broader 
problem is now to be discussed. If it is possible in dealing 
with time series to secure a definite mathematical equation 
for the relation between time and the normal values of the 

sses 
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items in a given series, cannot the same device be employed 
in studying the relationship between other variables? Can 
we not define, mathematically, the relation between cotton 
production and the price of cotton, between corn yield 
and rainfall, between earnings and the output of labor? 
If this can be done, it will place in the hands of the econo- 
mist a very powerful tool, giving his methods something 
of the precision which attaches to the work of the physical 
scientist. 

The Relation between Number of Taxable Personal 
Incomes and Motor Vehicle Registration 

As a typical problem we may consider the relation between 
the number of taxable personal incomes and the number 
of passenger automobiles registered, by states in 1934. 
The figures are given in columns (2) and (3) of Table 86.^ 

These figures are plotted in Fig. 69, each dot representing 
the relation between the number of taxable incomes and 
the number of registered passenger cars for a given state. 
Such a figure is termed a “scatter diagram.” It is clear 
from this diagram that there is a relationship between the 
two variables. In general, the states with a large number 
of taxable personal incomes are also those having a large 
number of motor vehicle registrations. The relationship, 
however, is not perfect. Two states with the same number 
of taxable incomes may differ quite widely in the number 
of registered vehicles. Thus both Rhode Island and Colorado 

^ Nine states for each of which there were more than 100,000 individual 
income tax returns and more than 685,000 passenger cars registered in 1934 
have not been included. The observations for these states, some of which are 
materially affected by the presence of important industrial centers, depart 
rather widely from those for the remaining states, and are marked by a func- 
tional relationship between jiersonal incomes and motor vehicle ownership 
somewhat different from that prevailing for the country at large. The states 
thus excluded are Hew York, Pennsylvania, New Jersey, Illinois, Massachu- 
setts, Michigan, Texas, Ohio, and California. The state of Washington has 
also been excluded, since the income tax returns for that state are combined 
with those of Alaska, in the reports of the Pureau of Internal Revenue. The 
rwulte are to be interpreted, of course, with these restrictions in mind. 



Table 86 


fa^ahle Personal Incomes and Passenger Automohile Registration in 
Thirty-^Eight StateSj lQ8i 


, ■ (1),, . 

(2) 

(3) 

(4) 

(5) 

(6) 


No, of taxable No. of passen- 





perso7ial in- 

ger cars reg- 




StaU 

comes j 1934 

isicredj 1934 





(thousands) 

(thousands) 





X 

Y 

XY 



Alabara.a ' 

23 

192 

4,416 

529 

36,864 

Arizona 

11 

80 

880 

121 

6,400 

Arkansas 

13 

162 

2,106 

169 

26,244 

Colorado 

31 

246 

7,626 

961 

60,516 

Connecticut 

91 

310 

28,210 

8,281 

96,100 

Delaware 

11 

45 

495 

121 

2,025 

Florida 

33 

280 

9,240 

1,089 

78,400 

Georgia 

38 

317 

12,046 

1,444 

100,489 

Idaho 

9 

91 

819 

81 

8,281 

Indiana 

70 

680 

47,600 

4,900 

462,400 

Iowa 

48 

591 

28,368 

2,304 

349,281 

Kansas 

36 

453 

16,308 

1,296 

205,209 

Kentucky 

35 

295 

10,325 

1,225 

87,025 

Louisiana 

37 

199 

7,363 

1,369 

39,601 

Maine 

21 

141 

2,961 

441 

19,881 

Maryland 

84 

288 

24,192 

7,056 

82,944 

IVIinnesota 

67 

594 

39,798 

4,489 

352,836 

Mississippi 

13 

141 

1,833 

169 

19,881 

Missouri 

98 

632 

61,936 

9,604 

399,424 

Montana 

17 

97 

1,649 

289 

9,409 

Nebraska 

27 

350 

9,450 

130 

729 

122,500 

Nevada 

5 

26 

25 

676 

New Hampshire 

17 

91 

1,547 

289 

8,281 

New Mexico 

8 

67 

536 

64 

4,489 

N. Carolina 

32 

385 

12,320 

1,024 

148,225 

N. Dakota 

10 

130 

1,300 

100 

16,900 

Oklahoma 

39 

403 

15,717 

1,521 

162,409 

Oregon 

27 

233 

6,291 

729 

54,289 

Rhode Island 

31 

124 

3,844 

961 

15,376 

S. Carolina 

15 

182 

2,730 

225 

33,124 

S. Dakota 

8 

146 

1,168 

64 

21,316 

Tennessee 

38 

299 

11,362 

1,444 

89,401 

Utah 

11 

85 

935 

121 

7,225 

Vermont 

10 

69 

690 

100 

4,761 

Virginia 

48 

317 

15,216 

2,304 

100,489 

W. Virginia ^ 

30 

167 

5,010 

900 

27,889 

Wisconsin 

93 

589 

54,777 

8,649 

346,921 

Wyoming 

7 

■52 

364 

49 

2,704 

Totals 

1,242' 

9,549 

451,558 

65,236 3,610,185 
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had 31,000 taxable personal incomes in 1921, yet the former 
had 124,000 passenger cars registered, while the latter had 
246,000. Were the relationship perfect a single and unchang- 
ing value of the F-variable would always be found paired 
with a given value of the X-variable. 

Our first problem is the derivation of an equation to 
describe this relationship which, while not perfect, is clearly 


Taxable Personal Incomes In 1934 (Thousands) 

Fig. 69. Scatter Diagram Showing the Relation between Taxable 
Personal Incomes and Passenger Car Registration, by States, in 1934, 
with Line of Average Relationship 

existent. There is here a relationship analogous to a trend, 
and it is apparently a trend which can be represented by 
a straight line. The equation to a straight line, fitted by 
the method of least squares to the points on the scatter 
diagram, will express mathematically the average relationship 
between these two variables. Such a line could, of course, 
be fitted by inspection, but a more accurate result will be 
obtained by the method of least squares. 

This calls for the solution of the following normal equa- 
tions: 

2(7) = Aa + 62(A) 

2(XF) = a2(A) + VZ{X% 

The values required for the solution of these ecjuations may 
be derived from the data as arranged in Table 86. Sub- 
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stituting, we have 

9,549 = 38a + 1,2425 
451,558 = 1,242a + 65,2365. 

Solving 

a = 66.321 
5 = 5.659. 

The required equation is 

Y = 66.321 +5.659X.1 

This line is plotted in Fig. 69. 

A mathematical expression has now been secured for 
the relation between the two variables being studied, the 
number of taxable personal incomes, by states, and the 
number of passenger automobiles registered. The former 
is the independent or X-variable in the equation, the latter 
the dependent or F-variable. This equation constitutes a 
measure of the functional relationship between these two 
variables, but it is only an expression of ajjeragfe relationship. 
How significant is the equation? If the relationship were 
perfect, and the plotted points all lay on the line describing 
this relationship, the equation could be used with confidence 
as an accurate instrument for determining the value of one 
variable from a value of the other. But a line with a definite 
equation may be fitted to points which depart very widely 
from it, which are widely dispersed. In such a case the 
equation may have the appearance of describing a precise 
relationship but the variation is so great that it cannot be 
used with confidence. It is the same problem as that which 
arises when an average is employed. We must know how 
significant the average is, how great the concentration about 
it, before we may use it intelligently. So the equation of 

^ In tlie chapters on correlation capital letters (TF', Z, F, etc.) are used to 
represent original values of the variable quantities, as measured from the zero 
points on the scales of actual values. Capital letters with prime marks are 
used to measure deviations from arbitrary origins, Z' and F' for such devia- 
tions in class-interval units, Z" and F" for such deviations in original units 
of measurement. Small letters (ic, x, y, etc.) are used to represent values of the 
variables expressed as deviations from their respective arithmetic means. 
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relationship between variables means little unless we know 
to what extent it holds in practical experience. We must 
have a measure of the dispersion about the line we have 
fitted. 

In describing the frequency distribution, the standard 
deviation is used as the best general measure of variation. 
It is, obviously, the measure we need in determining the 
reliability of the equation of average relationship. The 
standard deviation about this line will not only serve as a 
general index of the significance of this equation but will 
enable us to measure the degree of accuracy of estimates 
based upon the equation. 

THE COMPUTATION OP THE STANDAHD EBEOR 
OP ESTIMATE 

The standard deviation about a line of average relation- 
ship, being a measure of the accuracy of estimates, may 
be termed the standard error of estimate. The term standard 
demation is generally confined to the root-mean-square 
deviation about the arithmetic mean. The standard error 
of estimate is represented by the symbol S. 

In the computation of S we must know the computed value 
of Y which corresponds to each given value of X. By 
substituting the given values of X in the equation 

F = 66.321 •+■ 5. 659Z 

normal Y values may be computed. The deviations of the 
actual Y values from the computed may then be determined. 
The root-mean-square of these deviations is the required 
measure. A method of computation is illustrated in Table 87. 
From this table we have 

, /421,250.91 

. , =:^ril05.3.(thouBaiicI) iiiotor cmrs, 

(The symbol>Sj, is usedj as' this is tlMrstaiidard.error .of^^^^^ , 

F-variable.) 


Table 87 ; 

Cow/putation of Standard Error of Estimate 
(2) (3) (4) 

No, of passenger 


State 

cars registered, 
1934 

Alabama 

{in thousands) 
Y -actual 

192 

Arizona , 

80 

Arkansas 

162 

Colorado 

246 

Connecticut 

310 

Delaware 

45 

Florida 

280 

Georgia 

317 

Idaho 

91 

Indiana 

680 

Iowa 

591 

Kansas 

453 

Kentucky 

295 

Louisiana 

199 

Maine 

141 

Maryland 

288 

Minnesota 

594 

Mississippi 

141 

Missouri 

632 

Montana 

97 

. Nebraska 

350 

Nevada 

26 

New Hampshire 

91 

New Mexico 

67 

N. Carolina 

385 

N. Dakota 

130 

Oklahoma 

403 

Oregon 

233 

Rhode Island 

124 

S.' Carolina 

182 

. S*. Dakota 

146 

Tennessee 

299 

.'Utah ' ' 

85 

„ ...yenaont . 

69 

,,¥irginia''' : 

317 

W. Virginia 

167 

Wisconsin 

589 

Wyoming 

52 

Total 



jomptded 

(2) - (3) 


196.5 

- 4.5 

20.25 

128.6 

- 48.6 

2 , 361.96 

139.9 

+ 22.1 

488.41 

241.8 

+ 4.2 

17.64 

581.3 

- 271.3 

73 , 603.69 

128.6 

- 83.6 

6 , 988.96 

253.1 

-f 26.9 

723.61 

281.4 

+ 35.6 

1 , 267.36 

117.3 

- 26.3 

691.69 

462.4 

+ 217.6 

47 , 349.76 

337.9 

+ 253.1 

64 , 059.61 

270.0 

+ 183.0 

33 , 489.00 

264.4 

+ 30.6 

936.36 

275.7 

- 76.3 

5 , 821.69 

185.2 

- 44.2 

1 , 953.64 

541.7 

253.7 

64 , 262.25 

445.5 

+ 148.5 

22 , 052.25 

139.9 

+ 1.1 

1.21 

620.9 

+ 11.1 

123.21 

162.5 

- 65.5 

4 , 290.25 

219.1 

+ 130.9 

17 , 134.81 

94.6 

- 68.6 

4 , 705.96 

162.5 

- 71.5 

5 , 112.25 

111.6 

-44.6 

1 , 989.16 

247.4 

+ 137.6 

18 , 933.76 

122.9 

+ 7.1 

50.41 

287.0 

+ 116.0 

13 , 456.00 

219.1 

+ 13.9 

193.21 

241.8 

- 117.8 

13 , 876.84 

151.2 

+ 30.8 

948.64 

111.6 

+ 34.4 

1 , 183.36 

281.4 

+ 17.6 

309.76 

128,6 

- 42.6 

1 , 814.76 

122.9 

- 53.9 

2 , 905.21 

338.0 

- 21.0 

441.00 

236.1 

- 69.1 

4 , 774.81 

592.6 

- 3.6 

12.96 

105.9 

- 53.9 

2 , 905.21 

421 , 250.91 
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The measure <S„ is to be interpreted in precisely the same 
way as the standard deviation about an arithmetic mean. 
Given an approximately normal distribution of items about 
the line of relationship, 68 per cent of all the cases will lie 
within a range of ± ;S (in this case 105.3), 95 per cent 
will fall within ±25 (in this case 210.6) and 99.7 will 
fall within ±35 (in this case 315.9). If there were no 
scatter about the line fitted to the points representing the 
corresponding values of X and F, 5 would have a value 
of zero, and the value of F could be estimated from the 
value of X with perfect accuracy. The less the dispersion 
about the line, the smaller the value of 5. The value of 
5 serves, therefore, as an indicator of the significance and 
usefulness of the line which describes the relationship 
between the two variables. The standard error, it should 
be noted, is expressed in the same units as the original 
F-values. 

THE MAKING OP ESTIMATES 

We may, for a moment, consider the significance of these 
results. Ijet us assume that, not knowing the number of 
motor vehicles registered in a given state, we are under the 
necessity of estimating it. Two methods are open to us. 
We may, in the first place, base the estimate upon our 
knowledge of the F-variable alone. The total number of 
passenpr automobiles in the 38 states included in the 
study is 9,549,000. Dividing this by 38 we have 251,289 
as the average. With no specific information as to the 
registration in a given state, the arithmetic mean of all 
the state figures would be taken as the most probable value 
for the state in question. (The most probable value of a 
series of observations is the mean of the series.) How may 
we judge of the accuracy of this estimate? The standard 
deviation of the original distribution is a mea.suro of the 
degree of variation about the mean and, therefore, a measure 
of the accuracy of an estimate based upon the mean. If 
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the distribution approximates the normal type, the chances 
are 68 out of 100 that the true value for the state in question 
will not differ from the mean by more than the standard 
deviation. The standard deviation of passenger automobile 
registration by states, as recorded in Table 86, is 178.5. 
The mean affords, therefore, a basis for a reasonable estimate, 
and the standard deviation affords some indication of the 
probabilities involved in making this estimate. 

Another method of estimating the motor vehicle registra- 
tion in a given state is open to us if we know the number 
of taxable personal incomes in that state. We know, as a 
result of the study described in the preceding pages, that 
the average relationship between passenger car registration 
and number of taxable personal incomes is described by the 
equation 

F = 66.321 4- 5. 659Z. 

(The unit is 1,000 for each variable, it will be recalled.) 

If a state has 50,000 taxable personal incomes, it may be 
estimated from this equation that there are 349,271 passenger 
automobiles registered in that state. This is the most prob- 
able value of Y as determined from the equation of average 
relationship. Is this estimate any better than the previous 
one, which took the mean Y as the most probable value? 
Does our knowledge of the average relationship between 
X and Y aid us in estimating the value of Y from a known 
value of X? 

The answers to these questions are given by the standard 
error of estimate, and by the relationship between the 
standard error of estimate and the standard deviation of Y. 
The standard error of estimate (that is, the standard devia- 
tion about the line of average relationship) is 105.3. The 
standard deviation of Y is 178.5. Clearly the estimate 
made from the equation is more accurate than the estimate 
based upon the value of the mean Y. In the former case 
the odds are 68 out of 100 that the error will not exceed 
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105.3 or, in terms of the original units, 105,300 vehicles. 
When the estimate is made from the mean, the odds are 
68 out of 100 that the error wiU not exceed 178,500 vehicles.^ 
From our knowledge of the relationship between the two 
variables, even though that relationship is by no means 
constant or perfect, we are able to reduce materially the 
errors of estimate. 

THE COEFFICIENT OP COEBELATION 

We have now secured two measures which aid us in 
describing the relationship between variable quantities. 
The first is the fundamental equation of relationship, the 
expression of the degree of change in one variable associated, 
on the average, with a given change in the other. The second 
is the standard error of estimate, the measure of the degree 
of “scatter” about the line of average relationship. The 
standard error resembles the standard deviation in that 
it is a measure expressed in absolute terms, in the units 
employed in measuring the original F-values. This measure 
enables us to determine in a given ease the probability that 
an estimate based upon the equation of relationship will 
fall within certain limits. 

In measuring variation it has been found that an abstract 
measure of variability is needed, one which is divorced 
from the absolute terms of the given problem. Such a 
measure is particularly needed, it was noted, when different 
distributions are to be compared. So, for measuring the 
degree of variability, a coefficient of variation is employed. 
There is need of a somewhat similar measure in connection 
with our present problem. We need a measure of the 
degree of reZafe'owAtp between two variables, an abstract 
coefficient which is divorced from the particular units 

^ In the present case, with a iimited number of items and distributions which 
depart somewhat from the normal type, the precise probabilities cannot be so 
accurately determined from the values of 8y and <7*^. With this qualification 
In the matter of interpretation we may use By and as useful measures of 
dteperaion. 
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employed in a given case. Karl Pearson has developed such 
a coefficient. 

This measure may be explained in terms of the preceding 
discussion. It was found that the usefulness of estimates 
based upon the equation of relationship could be determined 
by comparing the standard error of F (the measure of 
scatter about the line of relationship) with the standard 
deviation of Y. If the standard error be as great as the 
standard deviation the equation of relationship is of no 
use to us, but if the standard error be less than the standard 
deviation the accuracy of estimates may be improved by 
using this equation. The significance of the equation is thus 
indicated by the relation between the standard error and 
the standard deviation. But these are both in absolute 
terms, so that by dividing one by the other an abstract 
measure may be secured. Thus we might write 

s 

Measure of correlation = — • 

cr, 

A somewhat more useful measure is secured by putting the 
ratio in this form; 


Measure of correlation = 



This measure, when used in connection with a linear equa- 
tion, is called the coefficient of correlation and is represented 
by the symbol r. 

A brief consideration of this formula will help to make 
clear the significance of r. If there is no dispersion about 
the line of relationship, Sy will have a value of zero; the 
equation describes a perfect relationship between the two 
variables. In this case, as is clear from the formula, r must 
have a value of 1. 

The maximum value of 8y is one which is equal to ay. 
Under these conditions, when the equation of relationship 
is of no aid in improving our estimates, the formula will 
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give zero as the value of r. Such a value indicates that 
there is no relationship between the two variables; in other 
words, that the straight line of best fit is horizontal, passing 
through the mean of the F’s. It shows that there is no 
tendency for the high values of Y to be associated with 
high values of X or for high values of Y to be associated 
with low values of X. The two variables fluctuate in absolute 
independence. In such a case the deviation of each point 
from the fitted line is equal to its deviation from the mean, 
and the two root-mean-square deviations are equal, as 
stated. 

Zero and unity are thus the limits to the value of r. 
The values found in practical work fall somewhere between 
these limits, approaching unity in cases where the degree 
of relationship is high. The greater the value of r, the 
greater the confidence that may be placed in the equation 
as an expression of a relation which is approximated in a 
high percentage of cases. In the example presented above, 
dealing with motor vehicle registration and number of 
taxable personal incomes, we have 


(178.5)2 


.81. 


This value indicates a definite and fairly close connection 
between these two variables for the states included in the 
sample. 

The coefficient of correlation may be made somewhat 
more significant by giving it the sign of the constant h in 
the equation of relationship. This sign indicates whether 
the slope of the line is positive or negative and, when 
attached to r, enables us to tell whether the relationship 
is direct or inverse. Thus in the present case high values 
of one variable are paired with high values of the other. 
The correlation is positive and the coefficient should be 
written -+• . 81 . When cotton production and prices are 
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correlated the relationship is an inverse one, for high values 
of one variable are generally associated with low values of 
the other. 

The measurement of relationship in a given case is com- 
pleted when we have secured the three measures described. 
The eg'wofeon of average relationship is an expression of 
'the underl 3 dng law connecting the two variables, if such a 
law may be assumed. The standard error of estimate meas- 
ures the variation, in absolute terms, about the line of rela- 
tionship. The coefficient of correlation is an abstract measure 
of the degree to which the average relationship actually holds 
in practice. 

DETAILS OP CALCULATION 

In the preceding section the attempt has been made to 
explain the various measures necessary in studying the 
relationship between variable quantities without introducing 
a detailed explanation of procedure. We may now return 
to a consideration of the details of calculation, including 
certain methods by which this calculation may be reduced 
to a minimum. 

The procedure followed in the preceding illustration is a 
logical one to employ in deriving the three required values. 
This method is capable of general application, but the labor 
involved may be materially reduced by taking advantage 
of a short-cut method of deriving Sy. This method may be 
first explained with reference to data of the type dealt 
with above. And, for the present, the discussion will be 
confined to cases in which the relationship between variables 
may be described by a straight line. 

The first problem is the derivation of the equation of 
relationship. A line of the type 

Y = a + bX 

is fitted by the method of least squares. 

The next step is the computation of the square of the 
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standard error of estimate. This was done in the above illus- 
tration by measuring the deviation of each individual obser- 
vation from the fitted line, and getting the mean-square of 
these deviations. It may be shown^ that this value can be 
derived from the following equation: 

c 2 _ 2(F2) - aS(F) - 6S(ZF) 

^ 

The quantities a and b are the constants in the equation to 
the fitted straight line. The other values relate to the 
original observations. Substituting in this equation a and b 
and the other necessary values, taken from Table 86, we have® 

* The standard error of estimate is computed from the formula 

N 

where d represents a single deviation from the fitted line, or the difference 
between the actual and the computed value of Y in a given case. The latter 
is derived from the equation 

Yc - a -f bX. 

(The symbol^ Yc is used to represent the computed value of F.) 

If we let 1 represent the actual value, we liave, for each residual, 

d^Yc-'Y 

or 

d == a 4* bX — F. (Ij 

There will be as many equations of this type as there are points. MultiDiv- 
ing each one by d, and adding, we have 

S(d2) =. aXid) 4- bX(dX) - S(dF). (2) 

But, since the line was fitted by the method of least squares, 

S(d) == 0 

mx) = 0 

(for proof of this see Appendix A) 
and, therefore, 

2(d2) = - 2(dF). (3) 

J^turning again to equation (1), we may multiply throughout by Y, and 
add, securing “ ■' ’ 

SWF) = oS(F) + bS(.YF) - S(F»). (4) 

Substituting the equivalent of SWF) in equation (3), we have 

SW*) = S(F2) - aS(F) - te(ZF) (5) 

from which the given formula for V is derived. 

“ and ft are given to a greater number of 
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3,610,185 - (66.32136 X 9,549) - (5.65925 X 451,558) 
S/ = —- ^ ^ ^ 38 ~ 

= 11,090 
S, = 105.3. 

From this point the procedure may follow that already 
described, r being computed from the formula 


4/1 


§ 1 . 
/T 2 


The coefficient r may be secured, however, without com- 
puting S as an intermediate value. The above formula for 
r may be reduced to 

, _ a2(F) + bZ(XY) - 
2(7^) - 

where c„ is the difference between the mean Y and the 
origin employed in the calculations.^ If the origin is zero 


» The formula 


may be written 


= 1 


r2 * 1 




"" Cy" 


in which y refers to deviations from the arithmetic mean of the F's. But 
N ^ N 

where F represents a deviation from an arbitrary origin (in this case zero on the 
original scale) and Cy represents the difference between this origin and the 

mean of the F^s. 

■ Therefore 

r2 = 1 

S(F2) - JVc,2 

Substituting in this equation the equivalent of 2 (^ 2 ), as given in the footnote 

■"■oiLpage'BSS, 

S(F2) -- aS(F) - 6S(XF) 

S(F2)~iV%2 

Simplifying, 

.2 = <^S(F) +hS(XF) - 

S(F2)-^XV 


;f2 1 
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on the original F scale, c„ will be equal to the arithmetic 
mean of the F’s. 

In the present case, using the data of Table 86, we have 
9,549 


The other values are the same as those employed abov 
computing Sy. Substituting in the formula, we have 

= 789,228.1 4 
’ 1,210,630.86 

= .6519 


In enect, then, the labor of fitting a straight line by the 
method of least squares gives us practically all the values 
needed in securing S and r, the two other measures necessary 
for a complete description of the relationship between two 
variable quantities. The only additional values requiretl 
are 2(7^) and c„. 
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count rates of commercial banks. Since the paper discounted 
by commercial banks may be rediscounted at the Federal 
Reserve banks by the member banks, some degree of 
relationship between the rates may be expected. Our present 
object is the measurement of that relationship. 

The first step is the tabulation of the original observa- 
tions. Monthly values of each variable^ were secured for 
each of the twelve Federal Reserve cities over a period of 
150 months, from July, 1920, to December, 1932. In the 
process of tabulation the items must be combined so that 
a Federal Reserve bank discount rate is paired with the 
corresponding rate charged by the commercial banks of the 
same city. Fig. 70 illustrates the method of tabulation. 

Tabulation having been completed, a correlation table 
designed to facilitate the later computations may be con- 
structed. Table 88 illustrates a suitable form. 

In Table 88, it will be noted, an arbitrary origin is em- 
ployed for each variable, and the class-interval unit is used 
in the calculations. We here employ the symbols X' and 
Y' to represent deviations from the arbitrary origin (which 
is located at point 1 . 60, 3 . 50 on the original scales). 

COMPUTATION OP MEASURES OP RELATIONSHIP 

From this correlation table all the values needed in 
fitting a. straight line to the data, and in computing the 
measures S and r, may be derived. The quantities S(X0, 
S(X'“), 2(70) and 2(7'0 are computed by methods already 
familiar to the student. The product of the paired values 
2(X'F0 may be computed directly from the correlation 
table, but it is perhaps simpler for the beginner to re-arrange 
the data in columnar form, as in Table 89 on page 345. 
When the figures are disposed in this way one line is em- 

^ The discount rates of the Federal Beserve banks relate, for the first part 
of the period covered, to trade acceptances; for later years they are “rates 
for member banks on eligible paper. The commercial bank rates are those 
charged on customers^ prime commercial paper. The customary rate over 
a given 30 day period was taken as of the middle of that period. 
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Table 89 

Discount Rates of Federal Reserve Banks 
Commercial Banks 


and Disco wit Rates of 


(Computation of values required in curve fitting) 


0 

0 

0 

0 

0 

9 

38 

57 

116 

64 

162 

976 

960 

36 

24 

225 

1,332 

2,940 

1,170 

21 

12 

1,440 

3,280 

4,200 

1,260 

180 

525 

4,380 

5,250 

320 

1,440 

144 

924 

480 

1,188 

60 

198 

245 

224 
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we have only to substitute the proper values in the equation 


,, _ aS(F0 + h'EiX'Y') - Ncy^ 
S(F'2) - NCy^ 


When this is done we have 

( -■ 10277 X 6,904) + (■ 705 09 X 4 2 ,932) - (1,800 X 14.71149) 


= - 


30,878 -(1,800X14.71149) 


3,080.7178 
■" 4,397.3180 
= .70059 
r = + .837. 

All these calculations have been carried through in class- 
interval units, with reference to an origin at point 1 .60, 3 .50 
on the original scales. The value of r is not affected by this 
fact, but the estimating equation and the standard error 
of estimate should be corrected. 

The value of Sy, in class-interval units, is .855. Since 
the class-interval of the F-variable is .50, we have, in 
original units. 


.50 X .855 
.4275. 


The equation may be corrected in a similar fashion. 
The class-interval being .50 both for X and Y, each unit 
on the original scale equals two class-interval units. Thus 
a range of 4 points on the original scale is equivalent to 
a range of 8 points on the class-interval scale. For conven- 
ience we may use Y" and X" to define deviations in original 
units (i.e., deviations from the arbitrary origin), where we 
have used Y' and X' to define corresponding deviations in 
class-interval xmits. Then, for any stated deviation, 
2F" = F'; similarly 2X" = X'. Retaining the values of 
a and b in the equation of average relationship, and sub- 
stituting 2F" for Y' and 2X" for X', we have 

2F" =- .10277 -t- .70509(2Z")- 
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Simplifying this, we have 

F" =- .05138 + -70509Z" 
which is the equation in terms of original units. 

This equation refers to an origin whose coordinates 
are 1.50 and 3.50 on the original scales. That is, 
^ —3.50, and X" = X — 1.50, where F and Z 
define deviations, in original units, from the zero points 
on the original scales. Making these substitutions we have 

F - 3.50 = - .05138 + .70509(Z - 1.50). 

Simplifying, and rounding off the constants by dropping 
figures that are not significant, we have 

F = 2.391 + .705Z. 

We have now the three values required for determining 
the relationship between Federal Reserve discount rates 


, 1 , 25 . 1.75 2.25 2.75 3.25 3..75 4.25 4.75 5.25 5,75 6.25 6,7 
X-Federal Reserve Bank-Rate-Percent 

Fig. 71. — Scatter Diagram of Federal Reserve and Commercial 
Kates, with Line of Average Relationship and Zones of Estin 

and corresponding commercial bank rates, during the p 
covered. The equation describes the average relation 
the standard error of estimate serves as a measure o 
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reliability of estimates based upon this equation, and the 
coefficient of correlation serves as an abstract measure of 
the degree of relationship between the two variables. 

The significance of the standard error, is brought out 
graphically in Fig. 71. The line of average relationship 
has been drawn on this scatter diagram, and what may be 
called “zones of estimate” have been marked out about 
this line. Within the zone having a width equal to 2^S, 
centering at the fitted straight line, 68 per cent of all the 
points should fall, on the assumption that the distribution 
is normal. Within the zone having a width equal to 6»S, 
centering at the fitted straight line, 99 . 7 per cent of all 
the points should fall, on the same assumption. The smaller 
the value of S the narrower these zones would be, and hence 
the more accurate would be the estimates which are based 
upon the equation of average relationship. 

The Product-Moment Formula for the Coefficient 
OP Correlation 

In the preceding examples the coefficient of correlation 
has been computed from the formula 

, _ a2(Y) -h hS{XY) - Nc/ 

S(F2) - Ncy'^ 

which is based upon relations involved in fitting a straight 
line by least squares. The usual formula differs somewhat 
from this, and it is advisable that the student be familiar 
with it. 

When a straight line is fitted to data, the origin being 
at the point of averages, the two normal equations 

S(F) =Na + bSiX) 

2(ZF)= aS(X) + hS(X2) 


^(y) = A^a-t-6S(a:) 
Sixij) = a2(T) + 62(a;=) 


become 
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where y and x measure deviations from the point of averages. 
The first of these equations disappears and the second 
reduces to 


for 


S(a:y) = bZ{x^) 


S(a:) = 0 and S(y) = 0. 


The slope, b, is the only constant required, and this may be 
computed from the relationship 

Under the same conditions the formula 


reduces to 


r* = «S(F) + 

S(F^) ~ Nc/ 


b^ixy) 

S(?/^) 


for Cy - 0 when the deviations are measured from the mean 
of the I's. Substituting for 6 its equivalent, as just deter- 
mined, we have 

2(y^r.2(a:2)- 

But 2 ( 2 / 2 ) = N<r/ and 2(^2) = Ma/. 

Therefore 


and 


^xy) ■ 2(a^) 


, ^(xy) 

Nar^y 


in which x and y refer to deviations from an origin at the 
point of averages. 
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This formula may be expressed 


in which 


CTjcCTjy 

N 


The quantity p is the mean product of the paired values of 
X and y. 

The computation of the coefficient of correlation from 
this formula proceeds along lines somewhat different from 
those outlined above. As we have seen, both the arithmetic 
mean and the standard deviation may be readily computed 
by the selection of an arbitrary origin from which all 
deviations are measured, a later correction being made to 
offset the error involved in using this arbitrary origin. 
Similarly, the mean product p may be computed by a short 
method, requiring the use of assumed means and the applica- 
tion of a correction at the end of the process. 

If x' and y' represent deviations from points arbitrarily 
selected as assumed means, while p' represents the mean 
product of such deviations, then 

^ N 

The computation of p' is not difficult, for deviations may 
be measured from central points, and may be expressed in 
class-interval units. Having p' we may secure the true 
mean product from the formula 

P — P' — CzCy 

in which c* and Cy represent the differences between the 
true and assumed means of the a:’s and p’s, respectively.^ 

^ The following is a proof of this relationship: 

deviation of any point from assumed mean of 
X — deviation of same point from true mean of 
Cx “ difference between true and assumed means of 
y' ~ deviation of same point from assumed mean of 

(Footnote 1 continued on page $52) 
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The Peodxjct-Mombnt Method, Ungrouped Data 

This method may be illustrated with reference, first, to 
ungrouped data, using the figures for personal incomes (X) 
and passenger car registration (F), by states. The values 
required for this computation, as given in Table 86, arc 

A = 38 
S(X) = 1,242 
S(F) = 9,549 
S(X2) = 65,236 
S(F“) = 3,610,185 
S(XF) = 451,558. 

The mean product may be computed from the formula 


P 


CxCy. 


N N 

We may select as arbitrary origin the actual origin on the 
two original scales. Hence we have 


^ S(XF) 
N 




(When the arbitrary origin is at zero on the original scales, tiie 
symbol X corresponds to x' and F corresponds to y\ as used in 
the formulas.) 

For the two standard deviations 


(Footnote 1 conlinued from page SSI) 

y - deviation of same point from true mean of ?/’s 
cy == difference between true and assumed means C’f ifs 

, x' ^ X Cx 

x'l/ = (x 4- cx)(y 4- Cy) = xy 4- Cxy 4- Cyx 4- CxCy. 

For the sum of all such products for N points, we have 

Z(x'y') - '^(xy) 4 - Cx^v) + Cy^(x) 4-iVr^O/. 


But 

Therefore 


X{y) ~ 0 and S(x) = 0. 
S(xy) - nry) 4- Ncxcy 
S(x^?/0 _ 2(.cf/) 


N 

N 

or p 


N 
N 

' P* - CxCy 


4 - CxCy 


CxCy 
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2(Z“) 

N 


These measures may be computed readily from the values 
secured from Table 86: 

1 242 0 54Q 

= ^ = 32.684 c, = ^ = 251.289 

oo oo 

c/ = 1,068.2439 c/ = 63,146.1615 

P = - 8,213.1297 

OO 


63,146.1615 


8,213.1297 


65,236 


3,670.0753 


1,068.2439 o-„ = 


3,610,185 


63,146.1615 


178.49 


Solving for the coefficient of correlation, 

, ^ .._3.gZ0 :gZg ^ + .8073. 

25.47 X 178.49 ^ 

The equation to the straight line which describes the 
average relationship between X and Y may be derived 
from the values required for the preceding calculations. 
When the origin is at the point of averages this equation 
may be written 

O-J, 

y = r—^x. 

(Xx 

Substituting the proper values, we have 



+ .8073 


178.49 

25.47 


X 


5.657a:. 


This, with an in-significant difference, is the equation secured 
by the method of least squares. The constant term repre- 
senting the j/-intereept disappears, since the origin is at 
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the point of averages, through which the least squares Ime 
must pass. ^ 

When the produet-moment method is employed in com- 
puting the coefficient of correlation and in determining the 
equation of regression, the standard error, may be 
derived by a simple change in the formula first presented 
for r. From the expression 

ilSf 2 

7*2 --S I _ ifiL 

we may secure the formula 

= OrjVl — 

which enables us to compute if we have the values of 
<r„ and r. In the present case. 


By = 178.49^1 
= 105.3. 


.8073 


The Product-Moment Method, Classified Data 

The product-moment method is also applicable to cases 
in which it is necessary to construct a double frequency or 


* That the formula y = r-^x'is equivalent to the formula based upon the 

method of least squares may be readily demonstrated. When the line passes 
through the point of averages, the equation, F = a + becomes y « hz. 

But h = We may write, accordingly, yc « 

This is equivalent to 

for the latter may be written 




( 1 ) Vc 


( 2 ) yc 


NcTyCFx Vx 


2 Je - r~^X 


(3) Me 


( 4 ) Me 


Hxy) 


N 


5M 


X, 


(The symbol yc is employed for the computed value of in these equations, 
to avoid confusion with the actual which appear in the right-hand members 
of the equations.) 
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correlation table. The procedure is shown in detail in 
Table 90. 

This table is identical with that previously presented for 
the same data, except that a different arbitrary origin has 
been selected. 

The value 4.60 is adopted as the assumed mean of the 
X’s (M'x), and the value 5.50 as the assumed mean of 
the F’s (M'],). Deviations are measured in class-interval 
units from this origin. In each compartment of the correla- 
tion table there are three figures, involved in the computation 
oi ^ix'y')- The figure in the center indicates the number 
of items falling in that compartment. Thus there are 
seven pairs having X values between 5 . 75 and 6 . 25 (mid- 
point 6.0) and Y values between 7.25 and 7.75 (mid-point 
7.5). For each of these pairs x' (the deviation from the 
assumed mean of the X’s) is -f 3, in class-interval units, 
and y' (the deviation from the assumed mean of the 7’s) 
is -h 4, in class-interval units. For each pair, therefore, 
x'y' = + 12. This figure appears at the top of the compart- 
ment. But there are seven pairs in this compartment, so 
the sum of x'y' for this group is 84. This figure appears 
in parentheses at the bottom of the compartment. To 
secure 2(a:'y') for the entire table it is necessary to add 
algebraically the values secured in this way for all com- 
partments. The addition is first carried out for the different 
rows, the sub-totals being given in the column at the right 
of the table. It is found that S(a:'yO = + 4,492, in class- 
interval units. 

Details of the computation of the coefficient of correlation 
are given in Table 91 on page 358. The values of c* and Cy 
are obtained by methods already familiar. 

We have, from that table, 

2(a:y) _ Ii(x'y') 

= +2.4277. 
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r ^ P 

NffxCTu <r^y 


= + .837. 

In computing r, both the numerator ^nri a ■ 

carried therefore, ie 

Table 91 

(Calculations based on the entries in Table 90) 


Cx ~ - 


.S’x 2 


.414 

An 

3,614 


== 4.50 
• 74 6 _ 

1,800 

fi2 ~ (- .414) 

^^06 
" 1,800 
0-*2 « Sx^ *~ rx 2 

« 3.614 — , 171 
3,443 
O'* » 1.855 

i^/* *4.50- .5C.414) 
» 4.293 


■ Ci, = . 




- 5. m 
^296 _ 
im 

c „2 « . 164)2 

^4,^6 
' 1300 

<T „2 « .V - Cy 

*2.470 - 

* 2.443 
A^v * 1.563 

* 5.50 

* 5.418 


. 164 
.027 
2.470 ■ 

.027 

.5(.164) 


- CxCy 

(-. 4143 * - 
.0679 


p = 

jV 

: =r _ 

' i“soo 

■ ;■* 2.4950 - 
* +2.4277 

CTzO-y 

= "^•^•^277 

( 1355 Kl. 56 a) 
_ +2.427 7 
'2.8994~’ ■ 

. ^ * + .837 


. 164 ) 


.hot:?,; Sssr'”""*' >»• 

da^ribt"tt7a4«7SlT *ch 

the formula ^ "ilatehip between x and y from 


2 / 


0-x 


-“lea- 

*>» cicwa. I, non;^^ SI'S: 
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This is done by naultiplying the present values by the 
class-intervals. 

(Tx (in. original units) = 1.855 X .50 = .9275 
a-y (in original units) = 1.563 X .50 = .7816. 

Substituting the given values in the formula, we have 


y 


.9375 
. 705x. 


The Lines op Regression 

In the above discussion certain terms ordinarily employed 
in the treatment of correlation have been purposely omitted. 
Several of these should be explained. 

The equation to the line of best fit in the preceding 
illustration was found to be 

y = .705a: 

when the origin was taken at the point of averages. In 
this equation y is expressed as a function of x-, that is, x is 
taken to be the independent variable and y the dependent 
variable. The equation expresses the average variation in 
y (discount rates of commercial banks) corresponding to a 
change of one imit in x (discount rates of Federal Reserve 
Banks). This line of relationship corresponds precisely to 
a line of trend, which describes the average change in a 
given series accompanying a unit change in time. A line 
which thus describes the average relationship between two 
variables is termed a line of regression. Its equation is 

termed a regression eqvMion, and the quantity r — which 

gives the slope of such a line is called a coefficient of regression. 
The use of these terms dates back to early studies by 
Galton, dealing with the relation between the heights of 


nator is not altered. In practice it is advisable always to express the two 
standard deviations in original units at this stage of the calculations. 
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fathers and the heights of sons. Sons, Galton found, deviated 
less on the average from the mean heights of the race than 
their fathers. Whether the fathers were above or below 
the average, the sons tended to go back or regress towards 
the mean. He therefore termed the line which graphically 
described the average relationship between these two vari- 
ables the line of regression. The term is now used generally 
as indicated above, though the original meaning has no 
significance in most of its applications. 

In any given case equations to two lines of regression 
may be computed. One is an expression of the average 
• relationship between a dependent F-variable and an inde- 
pendent X- variable; the other describes the relationship 
lietween a dependent X-variable and an independent 
I -variable. The significance of the two may be indicated 
graphically. 

Figure 72 is^ derived directly from the scatter diagram 
presented in Fig. 71. The circle in each column represents 
the mean F-value of all the items falling in that column. 
Thus in the third column there are 40 cases including all 
those with X-values falling between 2.25 per cent and 
2 . 75 per cent. The F-values vary, however, being distributed 
as shown in Table 92. 

Table 92 


Computation of the Arithmetic Mean of an Array 


Class 

Hntervai 

Mid-pmit 

Frequmctf 

fni 



m 

f 

4.75 

4.25 

a. 75 

3.25 

- 5.24 

- 4.74 

- 4.24 . 

~ 3.74 

5.0 

4.5 

4.0 

3.5 

4 

16 

19 

■ 1 ' 

20.0 

72.0 

76.0 
3.5 




40 

171.5 


. 4.2875. 

40 

Similar mean values ai-e obtained for the other columns. 
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These are plotted in Fig. 72, together with the line of 
regression of F on N. 

In Fig. 72 the X- variable (Federal Reserve bank discount 
rates) is independent. As it increases from 4.0 per cent 
to 4.5, 5.0, 5.5 per cent, and so on, the average of com- 
mercial bank rates increases also. An average commercial 



1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25 
Means of (3.60)(3.90)(4.28K4.56)(4.86)(5.U)(5.60)(5.96)(6.40)(6.69)(7.25)(7.02) 
Columns Federal Reserve Bank Rates —Percent 


Fig. 72 . — Showing the Relation between Discount Rates of Commercial 
Banks and Federal Reserve Bank Discount Rates. (The broken line 
connects the means of the columns and the straight line shows the average 
change in commercial bank rates corresponding to a unit change in Federal 
Reserve bank rates; i.e., it represents the regression of Y on -Y) 

bank rate of 4.29 per cent was associated with an average 
Federal Reserve bank rate of 2.5 per cent; an average 
commercial bank rate of 4.56 per cent was associated with 
an average Federal Reserve bank rate of 3.0 per cent, 
and so on. (The commercial bank rates cited are the means 
of the entries in the columns referred to.) The slope of 
the straight line, which is the line of regression or the line 
of average relationship, measures the average increase in 
commercial bank rates corresponding to a unit increase in 
Federal Reserve bank rates. 

It is possible to view the relationship between these two 
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variables m another light. These questions arise: Given 
a certain commercial bank discount rate, what is the average 
Federal Reserve bank rate associated with it? And for a 
given change in commercial bank discount rates, what is 
the average Jange in the corresponding Federal Reserve 
bank rates. The commercial bank rate is now looked unon 
as independent, and the Federal Reserve rate as an associ- 
ated dependent variable. These questions are answered by 
hig. 73. The points marked by the small circles and con- 
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Federal Reserve Bank Rates “Rercent 

Pi(3. 73. — Showmg the Relation between Federal Reserve Bank Discount 
Rates and the Discount Rates of Commercial Banks. (The broken line 

ange in B^ederal Reserve bank rates correspon ding to a unit chanee in 
commercial bank rates; i.e., it represents the regression of X on F) 

neeted by the broken line show the locations of the arithmetic 
meaM of the items falling in the various rows. Thus the 
bottom row have an average value of 
2.75 per cent. This is the average Federal Reserve bank 
*scount rate associated with a commercial bank rate of 
3.5 per cent. The average Federal Reserve bank rate 
^sociated with a commercial bank rate of 4.0 per cent is 
2.93 per cent, and so on. The straight line fitted to these 
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points indicates the relationship between the two, its slope 
measuring the average increase (or decrease) in Federal 
Reserve bank rates associated with a unit change in com- 
mercial bank rates. 

This is the line of regression of X on F. The general 
formula for the equation to this line is: 


CTx 

r—y. 


Substituting the present values, we have 

oo^ -9275 
a: - .837 


The factors in this equation, it will be seen, are the same as 
those entering into the formula for the line of regression of 
y on x\ If r is equal to 1 the two lines coincide, and if, 
in addition, the two standard deviations are equal, the line 
of regression will bisect the angle formed by the axes. 
If the points be plotted on a chart scaled in units of the 
standard deviations, we have y = rx] the slope of the line 
of regression is then equal to the value of r. 

The coefficient of regression is represented by the sym- 
bol 6. In a simple correlation problem there are two such 
coefficients, representing the slopes of the two lines of 

^ The formula 


.my. 






. may . be reduced to 

This is the equation to a line fitted to the points plotted in Fig. 73 in such a 
way that the sum of the squares of the horizontal deviations is a minimum, 
.'.The formula 

is the equation to the line for which the sum of the squares of the vertical 
deviaiiom is a minimum. An understanding of this point may make clear 
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regression. These are • 






(Tx 


(The subscripts indicate the relation between the two varia- 
c?e.) dependent variable in each 

The coefficient r appears in both formulas. This beine 
so, It IS clear that r may be computed from the regression 
coefficients. For s 




'yx * Oxy 


/(/ 

f Cfx 


Vr2 


Thus if we know the slopes of the two lines of regression 
r may be determined. In the present example 

r = VT05ir7993 = .837 

USE OF THE EQUATIONS OF REGRESSION 
The two equations of regression given above 
y = .705x 


and 


. 9932 / 


describe relations between deviations from the respective 
arithmetic means. That is, the origin is at the point of 

averages, and to use the equations we cannot use the 

original valiies of X and F but must express them as devia- 
. ons from their means. For example, we wish to determine 
_ e noimal commercial bank rate associated with a Federal 
^serve bank rate of 6 per cent. The mean value of the Z- 

rTet % ^ ^-293 per cent. A 

of + 1.707. Substituting this value in the first of the 
above equations, we have 
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2/ = .705 X (+ 1.707) 

= + 1.203. 

This is the average y-deviation associated with an ^-deviation 
of + 1 . 707. To get the normal commercial bank rate 
associated with a Federal Reserve rate of 6 per cent the 
quantity + 1 . 203 per cent must be added to the mean 
commercial bank rate, 5.418 per cent. The value we wish 
is thus 6 . 621 per cent. 

This calculation has been rather round-about because 
of the form of the equation of relationship. This equation 
can be put in more appropriate form for such computations. 
Let 

X = arithmetic mean of the A’s 
y = arithmetic mean of the F’s. 

Then 



may be written 

Y -Y = AiX -X). 

In this last equation X and Y represent the values of the 
variables on the original scales, and not as deviations from 
their respective means. In terms of the coordinate chart, it 
means shifting the origin from the point of averages to a 
point corresponding to zero on each of the original scales. 

To illustrate the greater utility of the equation in this 
form, the equation 

^=.705a: 

may be changed in the manner indicated. It becomes 

F - 5.418 = .705(Z - 4.293) 

= .705Z-3.027 
F = 2.391 + .705X. 

This is the equation with the origin so shifted that the 
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original values may be employed directly. To determine 
the commercial bank rate normally associated with a Fed- 
eral Reserve rate of 6 per cent we may substitute the latter 
value in the equation just secured. 

F = 2.391 + (.705 X 6.0) 

= 6.621. 

Precisely the same results are secured as with the equation 
in the other form, but for many purposes it is preferable 
to have an equation in which the actual values may be 
inserted. 

The equation 

X = T — y 

may be similarly changed to 

X ~X = r- (F- F). 

Summary op Correlation Procedure 

In the foregoing pages there have been presented two 
quite different methods of securing the values required in 
measuring the relationship between two variables. The 
steps in the two methods may be briefly summarized. The 
method of least squares is basic in both cases, but that term 
may appropriately be employed to describe the first method 
outlined, for the process of fitting the line is the first and 
fundamental step in that procedure. 

1. The Least Squares Method. 

A. Data to be handled as individual items. 

1. Fit a straight line to the data by the method of least 
squares. A simple arrangement of the data in columns 
will permit the ready computation of the required 
values, S(Z), S(F), S(X=), S(F=), The equa- 

tion thus obtained describe.? the average relationship 
between the two variables. 
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2, Gompute the standard error of estimatej from the 
formula 

, , S(F^) -- a2(F) ^ 5S(X7) 

S, = — 

Sy is a measure of the reliability of estimates based 
upon the equation of relationship, and is to be inter- 
preted in the same way as is the standard deviation 
about an arithmetic mean, 

3. Compute the coefficient of correlation, r, from the for- 
mula 



or from 

.2 ^ aS(F,) + bX(XY) - Nc/ 

’ S(F^) - Ncy^ 

Give T the sign of the constant h in the equation of regres- 
sion. This coefficient is an abstract measure of the degree 
of relationship between the two variables, in so far as this 
relationship may be described by a straight line. 

4. If an equation describing the regression of X on F 
(X being dependent) is desired, the proper values may 
be substituted in the two normal equations 

S(Z) = iVa + 6S(F) 

S(XF) = aS(F) + 6S(F2). 

The equation secured will be of the type 

X = a + ^F. 

The standard error of estimate, &, may be computed by 
making the appropriate changes in the formula as given 
for Sy. The value of r will be the same as in the pre- 
ceding case, in which F is dependent. 

B. Data to be classified. 

1. Select an appropriate class-interval and tabulate the 
items in the form of a correlation table. 

2. Compute the necessaiy values for fitting a straight line 
to the data. In doing so, an arbitrary origin may be 
selected for each variable, and all values expressed in 
class-interval units, A re-arrangement in columnar form 
may facilitate the computation of the quantity 

S(Y'F0. 
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3. Compute the standard error of estimate, employing the 
formula given above. 

4. Compute the coefficient of correlation from the formula 
given above. 

5. If the above calculations were carried on in class-interval 
units, the equation of average relationship and the stand- 
ard error of estimate should now be expressed in terms of 
the original units of measurement. If an arbitrary origin 
was employed, the equation should be corrected 'so that 
the variables relate to deviations from the true origin. 

The Product-Moment Method. 

A. Data to be handled as individual items. 

1. Arrange the paired obseivations in parallel columns and 


2(7), S(Z==), S(F^), 


compute the quantities S(Z), 

S(ZF). 

2. Divide these ciuantities throughout by N. For the first 
two of these quotients we may use the .symbols c* and 
Cy (i.e., 

I'CA) . 


N 


and 


Cy). 


S(F) 

N 

3. Compute the mean product from the formula 

■^2(ZF) 


P 


N 


c^c„. 


4. Compute the two .standard deviations from the formulas 


i/¥ 






— Cy^- 


5. Compute the coefficient of correlation from the formula 
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6. Determine the equations of regression by substituting 
the proper values in the formulas 

y = T — x 

Cfx 

<^x 

X = r — 

(Ty 

(Note: For each of these equations the origin is at the 
point of a,Yerages.) 

7. If desiredj transfer tlie origin to zero on the two original 
scales by substituting the arithmetic means in the 
equations . 

Y-Y==r^(X-X) 

Cfx 

X-X = r^(Y-Y). 

8. Compute the two standard errors of estimate from the 
formulas 

Sy = (TyV i — 

Sx = (TxVl — 


B. Data to be classiiSed. 


1. Construct a correlation table as in 1. B. above. 

2. Select an assumed mean for each variable. Measure the 
deviations of the various items from the assumed means 
in class-interval units. 

3. Compute Cx and % in class-interval units. 

4. Compute cr® and cfy in class-interval units. 

5. Compute 2{x^y^) in class-interval units for each compart- 
ment of the correlation table. Total these figures to get 
h{x^j/) for the whole table. 

6. Determine the value of the mean product in class-interval 
units from the formula 


^ - W) 

^ N 


- C^y- 


7. Compute r from the formula 
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8. Reduce Ci and ffy to original units. 

9. Determine the equations of regression by substituting 
the proper values in the formulas 

y = 

0-x 

and 



10. If desired, transfer the origin to zero on the two original 
scales from the formulas 

Y-Y = r^{X-X) 

0^ X 

X - X = r^(Y - ¥)■ 

(Ty 

11. Compute the two standard errors of estimate from the 
formulas 

Sy = (Xy^/ 1 — 

Sx = (Tx's/T — 

It is advisable, in all cases, to construct scatter diagrams 
and to plot the lines of regression thereon. It is generally 
possible to derive from such diagrams a truer idea of the 
relations involved, and of the adequacy of the methods 
employed, than may be obtained from a study of the figures 
alone. 

LIMITATIONS: 

A question naturally arises as to the degree of generality 
attaching to the measures of relationship described in the 
preceding pages. Are they limited to certain types of dis- 
tributions, or may they be employed as absolutely general 
and universally valid measures? 

As we have seen, the standard deviation has a precise and 
definite meaning with respect to distributions following the 
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normal law. Having values of the mean and of the standard 
deviation, we know, in such cases, the exact percentage 
of observations which will fall within any stated limits. 
If the distribution departs from the normal type the standard 
deviation is still a useful measure, but it cannot be inter- 
preted in the same exact sense. Bearing this in mind, the 
formula 



may be considered. 

When the distribution of the original values of the 
dependent variable about their mean is normal and the 
distribution about the least squares line is normal, both 
Sv and txy have specific and exact meanings, and it is per- 
fectly legitimate to compute such a measure as r, based 
upon the relation of one to the other. Departures from 
normality in either ease reduce the significance of this 
comparison. But we have seen that the standard deviation 
remains a useful measure even though the departure from 
the normal type be fairly pronounced, though in the latter 
case it lacks the precise significance attaching to it in a 
normal distribution. In the same way the standard error of 
estimate and the coefficient of correlation may be computed 
and utilized, even when all the requirements of normality are 
not met. Care must be taken in their interpretation in 
such cases, however. It must be clearly recognized that 
these measures have their full significance only in cases 
where the original distribution of the dependent variable 
and the distribution about the least squares line are both 
normal, or approximately so. 

A simple example may make clear the effect upon the 
value of the coefficient of correlation of an extreme departure 
from a normal distribution. In the following table are 
listed certain selected figures taken from the 1919 Census 
of Manufactures, for the State of New York. 
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Table 93 

Wage-Earners in Factories and Value of Products, 1919, in Eleven 
Cities in the State of New York 

Number of wage- 

Citif earners {in 

thousands) 


Total value of 
products (in 
millions of dollar 


Batavia 

Beacon 

Corning 

Geneva 

Glens Falls 

Ithaca 

Middletown 

Peekskill 

Rensselaer 

Tonawanda 

New York City 


When the first ten of th(ise cities, in the order listed, 
treated as a group, the following values are secured : 

<T, = 1.8682 

Sy ~ i .8669 


(No general significance is to be attached to the above 
coefficient of correlation, for the cities were selected for the 
purpose of illustrating a particular point.) 

The ten points and the line of regression are plotted in 
Fig. 74. 

When New York City is included in the group, the values 
secured for the sample of eleven cities are 

(r„ = 1509.3 
Sy = 7.53 
r = + .999988. 

The eleven points and the line of regression are plotted in 
Fig. 75. 

The reason for the markedly different results is obvious. 
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When the one very large city is included with the ten small 
cities the standard deviations of both variables are greatly 
increased. That of the 7-variable (value of products) is 
increased from 1.8682 to 1509.3. But S„, the measure of 
the scatter about the fitted line, undergoes no such pro- 

Millions 
of 

. Dollarsf 


14 

3 

T5 

O 


>512 

0) 

B 

I 10 


1.5 1.7 1.9 2.1 2.3 2.5 2.7 2 - 

Thousands of Wage Earners 

Fig. 74. Showing the Relation between Number of Wage-Earnera m 
Factories and Value of Products in Ten Selected Cities in the State of 
New York 

Eounced change in value. the ten cities it is 1 . 8669 , 
for the eleven cities 7.53. This is due to the fact that the 

one exceptional case is given such a great weight, m fitting 
by the method of least squares, that the fitted line must 
pass through or very near the point representing this 
observation. Accordingly, S is always^ affected less than 
ff by a single very exceptional case. Since the value of r 
depends upon the relationship 
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value of the measure of correlation. The introduction of 
the one exceptional case in the above example changes a 
correlation coefficient of virtually zero to one of unity. 
The result, of corirse, is meaningless. 

While this example represents an extreme instance, the 
same distortion will be felt, to a greater or less degree, 
whenever there is a departure from a normal distribution. 
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Fig. 75. — Showing the Relation between Number of Wage-Earners in 
Factories and Value of Products in Eleven Selected Cities in the State of 
New York ,‘ 

In practice the various measures of relationsMp cannot 
be restricted to perfectly normal distributions, but they 
must be interpreted with care when there is reason to believe 
that such disturbing influences are present. 

The ■ Goefpicient of Ranb: CoEREnATiON 

The coefficient of rank correlation is a measure of rela- 
tionship not subject to the distortion introduced by material 
departures from normality, and one which is particularly 
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useful in providing an objective test of the existence of 
correlation. Its application calls merely for the orderly 
ranking of observations. Thus we may rank 47 states of 
the union^ according to the number of individual income 
tax returns in 1934, and according to the number of pass- 
enger automobiles registered in that year. The results are 
shown in Table 94. 


Table 94 

Illustrating the Computation of the Coefficient of Rank Correlation 


(1) 

(2) 

(3) 

(4) 

(5) 


Ranh on basis of 

Ranh on basis of 




mmiher of mdi” 

number of pas- 

Difference 


State 

vidual income 

miger automo- 

(2) - (3) 



tax returns in 

biles registered 

d 



1934 

in 1934 



Nevada 

1 

1 

0 

0 

Wyoming 

2 

3 

- 1 

1 

New Mexico 

3 

4 

~ 1 

1 

S. Dakota 

4 

15 

11 

121 

Idaho 

5 

9 

4 

16 

N. Dakota 

6 

12 

^ 6 

36 

Vermont 

7 

5 

+ 2 

4 

Delaware 

8 

2 

+ 6 

36 

Arizona 

9 

6 

+ 3 

9 

Utah 

10 

7 

+ 3 

9 

Mississippi 

11 

13 

2 

4 

Arkansas 

12 

16 

^ 4 

16 


No. of passenger avto- 

mobiles registered (in 
thousands) 1934 

1,769 

1,282 

687 


J Washington is excluded, because the published income tax returns for that 
state include those of Alaska. 

Following are the records for the nine states not listed in Table 86. 

No. of taxable personal 
incomes {in thou^ 
sands) 1934 

California 316 

Illinois 310 

Massacimsetts 243 

Michigan 
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Table 94 — Continued 

Illustrating the Compulcdion of the Coefficient of Rank Correlation 


(1) 

(2) 

Rank on basis of 
number of indi- 

State 

vidual income 
tax returns in 
1934 

S. Caroiina 

13 

New Hampshire 

14 

Montana 

15 

Maine 

16 

Alabama 

17 

Nebraska 

18 

Oregon 

19 

W. Virginia 

20 

Colorado 

21 

Rhode Island 

22 

N. Caroiina 

. ■■ 23 

Florida 

24 

Kentucky 

25 

Kansas 

26 

Louisiana 

27 

Tennessee 

28- 

Georgia 

29 

Oklahoma 

30 

Virginia 

31 

Iowa 

32 

Minnesota 

33 

Indiana 

34 

Maryland 

35 

Connecticut 

36 

Wisconsin 

■ 37 . 

Missouri 

38 

Texas . 

39 

Michigan 

40 

Ohio 

41 

New Jersey 

42 

Massachusetts 

43 

Illinois 

44 

California 

45 - . ■ 

Pennsylvania 

46 

New York 

47 


(3) 

(4) 

(0) 

Rank on basis of 

number of pas- 

Difference 


senger automo- 

(2) - ('3j 


biles registered 

d 


in 1934 

18 

- 5 

25 

8 

+ 6 

36 

10 

+ 5 

25 

14 

-f 2 

4 

19 

— 2 

4 

30 

- 12 

144 

■ . 21 

— 2 

4. 

17 

4- 3 

9 

■22 

- 1 

1 

11 

-f 11 

121 

31 

- 8 

64 

23 

4- 1 

1 

25 

0 

0 

33 

— 7 

49 

20 

4" 7 

49 

26 

4“ 2 

4 

29 

0 

0 

32 

- 2 

4 

28 

+ 3 

9 

35 

- 3 

9 

36 

- 3 

9 

38 

- 4 

16 

24 

+ 11 

121 

■ ■ 27 

4- 9 

81 

34 

4~ 3 

9 

37 

+ 1 

1 

■ 42 

- 3 

9 

41 

~ 1 

1 

44 

“ 3 

9 

40 

4- 2 

4 

39 

4“ 4 

16 

43 

4~ 1 

1 

46 

- i 

1 

45 

4- 1 

1 

47 

0 

0 



1,094 
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The degree of correlation is indicated by the degree of 
concordance between the two rankings. A precise measure 
of correlation is provided by the coefficient 

n^-n 

where d is a difference between the rankings of a given 
state in columns (2) and (3), and n is the number of states 
included.^ (The Greek letter rho (p) with subscript r is used 
as the symbol of this coefficient.) 

The method of computation is shown in Table 94. From 
the measurements there given we have 

6 X 1,094 _ 6,564 

■ (47)* - 47 103,776 

= .94. 

* This formula may be derived from the familiar product-moment formula 
for the coefficient of correlation, simplified because of the fact that the sums 
of the squares of the deviations of the first n natural numbers from their mean 

. • , ^ n* 

is equal to -—-r — 

12 

If we let rl equal the difference between the rank of one variable and the 
corresponding rank of the other, we have, for any given pair of observations, 
d ^ X — Y ^ X — y (since the means of the two 
series of ranks are identical) 

= 2(;r - ?/)2 == 2.r2 -f Zif - 2'Zxy 
2:^xy « S;r2 + - XdK 

But and 

12 12 

OyiS __ 

Therefore 2i:>xy — ~ — 

' St?/ . , ' 

But =—=== (the product-moment formula for r) 


12 
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The coefficient of rank correlation is appropriate where 
it is possible to rank individuals, or other entities, on the 
basis of abilities or qualities not open to exact measurement. 
It is also well adapted for use where the distributions of 
the observations depart widely from the normal type, and 
where the usefulness of customary measurements would 
be seriously impaired. This point takes on particular im- 
portance in connection with tests of significance, involving 
generalizations from sample results.^ Such tests are discussed 
in later chapters. 
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CHAPTER XI 


THE MEASUEEMENT OF RELATIONSHIP 
BETWEEN TIME SERIES 

The methods of measuring correlation described in the 
preceding chapter were devised originally for the analysis 
of non-historical data, that is, for the treatment of frequency 
series rather than time series. The measurement of corre- 
lation between series in time presents certain distinctive 
problems which require separate treatment. 

We have seen that such series are affected by various 
forces, which have been chissified as the secular trend, 
cyclical and seasonal fluctuations and accidental variations, 
and methods have been described by means of which the 
effects of these various forces may be isolated. This breaking 
up of a series into its component parts for separate study 
is essential in attempting to correlate series in time, for 
spurious and quite misleading results will be secured if 
this is not done. The problem of correlation is that of 
securing a precise measure of the degree of relationship 
between variable quantities. But each series in time repre- 
sents the combination of a number of variables and, so 
far as possible, each should be treated separately in corre- 
lating such series. 

The relationship between two time series as, for example, 
interest rates and bond prices, may be studied with respect 
to any or all of the following components: 

a. Secular trend. 

b. Cyclical fluctuations. 

c. Seasonal fluctuations. 

d. Changes from one tinic unit to the next (e.g., week to week, 
month to month, or year to year). 
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Such relationships may be studied, first, through the 
comparison of graphs, and much noay be learned by this 
simple process. The similarity or dissimilarity of secular 
trends, and the general relation between cyclical movements 
may be determined by a study of such graphs. For more 
accurate comparison the coefficient of correlation may be 
used, but when it is so employed it is particularly impor- 
tant that the precise nature of its employment and the 
exact significance of the results be understood. 

For the comparison of secular trends the coefficient of 
correlation would never be employed. The mere fact that 
two series have the same secular trend is no indication 
of a relationship of interdependence; a coefficient of correla- 
tion based upon the trend values would be meaningless. 
Moreover, much simpler methods are available for comparing 
trends. 

For the same reason a coefficient of correlation should 
not be based upon the original absolute values of two 
series in time, except in the rather rare case in which neither 
series is marked by a definite secular trend. The computa- 
tion of r, when dealing with ordinary statistical data, 
involves measuring the deviations of all the items from 
their respective arithmetic means, and securing the sum 
of the products of the paired deviations. Wlien deviations 
of like sign are paired throughout r will have a positive 
value ; when deviations of unlike signs are paired throughout 
r will have a high negative value. The presence of pro- 
nounced rising or declining secular trends makes it impossi- 
ble to secure significant values for r by the employment of 
this method. For example, the relation between automobile 
production and the price of bacon between the years 1900 
and 1920 might be measured. The secular trend is markedly 
rising in each case. When the deviations of the annual 
figures are measured from the arithmetic means of the 
two series, the paired items for the earlier years will be 
negative, for the later years positive. A fairly high positive 
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value for r would be secured, were the computation carried 
through on this basis. This value would be quite misleading, 
for no real relationship can be expected in this case. The 
coefficient of correlation in such a case would measure, 
primai-ily, the relation between the two secular trends. 

This coefficient might conceivably be employed to deter- 
mine the similarity between seasonal fluctuations in two 
series, but its utility for this purpose may be questioned. 
Here again other and simpler methods are available. 

In practice, therefore, the device of correlation should 
be employed neither to measure the relation between secular 
trends nor between seasonal movements. Its use is confined 
to comparisons of two or more series with respect to cyclical 
fluctuations and with respect to the short time changes 
from month to month or year to year. And, if valid measures 
of correlation are to be secured in making such comparisons, 
the effects of forces which distort these comparisons should 
be eliminated, in so far as this is possible. The actual work 
of correlation must be preceded by a sifting process designed 
to remove such irrelevant material. Unless the data are 
thus “distilled” the interpretation of the resulting coeffi- 
cients will be difficult. 

The Measurement op Correlation between Cyclical 

Fluctuations 

In an earlier chapter we have dealt with methods by 
which the effects of certain of the factors affecting time 
series might be measured and eliminated. The spurious 
correlation due to secular trend may be avoided by measur- 
ing the deviations of the observations not from the respec- 
tive arithmetic averages but from the lines of secular trend 
of the two series. These variations, the deviations from 
trend, are the significant values if our interest centers in the 
cycles. If annual values are employed the problem of elim- 
inating seasonal fluctuations is not faced. 

To illustrate this method of measuring the relationship 
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between series in time we may imdertake to determne 
whether there is any connection between cyclical 
in cotton production and m cotton prices. F g 
crop years are to be employed, for the period 1901-02 to 

Cotton prices require some correction before correlation 
is attempted. The raw figures with which the investigation 




Table 95 


( 1 ) 


Crop 

year 


1901- 02 

1902- 03 

1903- 04 

1904- 05 

1905- 06 

1906- 07 

1907- 08 

1908- 09 

1909- 10 

1910- 11 

1911- 12 

1912- 13 

1913- 14 

1914- 15 

1915- 16 

1916- 17 

1917- 18 

1918- 19 

1919- 20 

1920- 21 

1921- 22 

1922- 23 

1923- 24 

1924- 25 

1925- 26 

1926- 27 

1927- 28 

1928- 29 

1929- 30 

1930- 31 

1931- 32 

1932- 33 
19^3-34 

1934- 35 

1935- 36 


Coiton Production and Cotton Prices, 1901-1936 
( 2 ) 


Cotton produc-^ 
Hon in United 
States^ excluding 
Imters (in thoii^ 
sands of bales) 


9,510 

10,631 

9,851 

13,438 

10,575 

13,274 

11,107 

13,242 

10,005 

11,609 

15,693 

13,703 

14,156 

16,135 

11,192 

11,450 

11,302 

12,041 

11,421 

13,440 

7,954 

9,762 

10,140 

13,628 

16,104 

17,977 

12,956 

14,478 

14,825 

13,932 

17,096 

13,002 

13,047 

9,636 

10,443 


(3) 

Cotton prices. 
Average of spot 
prices in N. Y. 
for middling 
upland cotton, 
SepL to May 
(in cents per 
pound) 

8.64 

9.50 

13.20 

8.69 

11.40 
10.97 

11.41 
9.81 

14.62 

14.80 

10.34 

12.35 
13.40 

8.63 

12.04 

18.29 

29.96 

30.06 

38.63 
16 90 
18.67 
26.26 
31.79 
24.34 
20.60 
14.26 
20.19 

20.02 

17.00 
10,47 

6.42 

6.75 

10.95 

12.42 

11.59 
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(4) 


Bradstreefs 
price index, 
average, SepL 
to May 

( 1913-14 * 100 ) 


86.2 

90.0 

88.6 

89.3 

92.3 

98.8 

93.2 

91.3 

100.6 

97.8 

100.0 
.104.8 

100.0 

105.2 

121.2 

151.0 

197.9 

203.1 

226.3 

152.9 

127.2 

149.7 

145.3 

150.4 

153.8 

141.0 

149.2 

145.1 

132.1 

107.6 

86.1 

76.2 

100.6 

106.7 
112.6 


(5) 


Cottonprices, 
deflated 
(in cents per 
P0U7ld) 


10.02 

10.56 

14.90 

9.73 

12.35 

11.10 

12.24 

10.74 

14.53 

15.13 
10.34 
11.78 
13.40 
8.20 
,9.93 
12.11 

15. 14 

14.80 

17.07 
11.05 
14.68 

17.54 
21.88 
16,18 
13,39 
10 . 11 , 
13.53 

13.80 
12.'87 ■ 

9,. 74' 
::7.46 
,„ 8 . 85 ' 
10.89 

11.64 
10.29 
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crop year 1913-14 equal to 100. The original figures for 
the two series to be correlated, together with the corrected 
price figures, are given in Table 95. 

These data are plotted in Figs. 76 and 77. Lines of trend 
fitted to the two series are shown on the charts.^ 



Fig. 77. — Prices of Middling Upland Cotton in New York, Crop Years 
1901-1902 to 1935-1936, with Line of Trend. (Figures relate to average 
annual prices, during crop years, deflated by Bradstreet’s index of whole- 
sale prices) 

The deviation of each annual item from the secular trend 
of the given series is now to be measured, and the coefficient 
of correlation between these deviations is to be calculated. 
The computations appear in Table 96. 

This value of — , 648 for the coefiicient indicates a fair 
degree of negative correlation between deviations of cotton 
production in the United States from the line of trend and 

‘ Tlie equation to the line of trend of cotton prodpetion is 

F = 13,009.14 + 87.96Z - 4.640X» - .1491X^ with origin at 1918-19. 

The trend equation for deflated cotton prices is 

F =13.96 + .152X - .01425X» - .00083X^ with origin at 1918-19. 




Computation of Coefficient of Correlation^ 

Cotton Prices 


Cotton Production ani 


1901 - 02 

1902 - 03 

1903 - 04 

1904 - 05 

1905 - 06 

1906 - 07 

1907 - 08 

1908 - 09 

1909 - 10 

1910 - 11 

1911 - 12 

1912 - 13 

1913 - 14 

1914 - 15 

1915 - 16 

1916 - 17 

1917 - 18 

1918 - 19 

1919 - 20 

1920 - 21 

1922-23 

923-24 

1924 - 25 

1925 - 26 

1926 - 27 

1927 - 28 


1930-31 


( 2 ) 

Deviation of 
cotton pro- 
duction from 
trend (in 
1 , 000^5 of 
bales) 

X 

~ 1,395 

- 393 

- 1,298 
+ 2,161 

- 834 
+ 1,731 

- 572 ■ 
+ 1,428 

- 1,945 

- 476 
+ 3,476 
+ 1,357 
+ 1,648 
+ 3,542 

- 1,516 

- 1,366 

- 1,615 

- 968 

- 1,671 
+ 275 

- 5,273 

- 3,515 
- 3,174 
+ 290 
+ 2,758 ■ 
+ 4,637 

- 360 
+ 1,202 
+ 1,608 
+ 793 
+ 4,055 
+ 80 


( 3 ) 

Deviation of 
deflated cot- 
ton prices 
from trend 
(in cents 
per lb.) 

y 

- 1.32 

- .72 
+ 3.63 

- 1.59 
+ .95 

- .42 
+ .57 

- 1.11 
+ 2.49 
+ 2.87 

- 2.14 

- .93 
+ .45 

- 4.98 

- 3.47 

- 1.50 
+ L 35 
+ .84 
+ 2.97 

- 3. 15 
+ .41 
+ 3.25 
+ 7.62 
+ 2.00 

- .65 

- 3.73 

- .04 
+ .58 
+ .07 

- 2.56 

- 4.24 

- 2.17 


1 . 946.025 
154,449 

1 . 684.804 
4 , 669,921 

695,556 

2 , 996,361 

327,184 

2 , 039,184 

3 . 783.025 
226,576 

12 , 082,576 

1 , 841,449 

2 , 715,904 

12 , 545,764 

2 , 298,256 

1 , 865,956 

2 , 608,225 

937,024 

2 , 792,241 

75,625 

27 , 804,529 

12 , 355,225 

10 , 074,276 

84,100 

7 , 606,564 

21 , 501,769 

129,600 

1 . 444.804 
2 , 585,664 

628,849 

16 , 443,025 

6,400 


y 

1.7424 
.5184 
13.1769 
2.5281 
.9025 
. 1764 
.3249 
1.2321 
6.2001 
8.2369 
4.5796 
.8649 
.2025 
24.8004 
12.0409 
2.2500 
1.8225 
.7056 
8.8209 
9.9225 
.1681 
10.5625 
58.0644 
4.0000 
.4225 
13,9129 
.0016 
.3364 
.0049 
6.5536 
17.9776 
4.7089 


xy 

- 1 , 841.40 

- 282.96 
“ 4 , 711.74 

- 3 , 435. '99 

792.30 

- ' 727.02 

326.04 
" 1 , 585.08 

- 4 , 843.05 

* 1 , 366.12 
' 7 , 438.64 
' 1 , 262.01 

- 741.60 

■ 17 , 639.16 

■ 5 , 260.52 

■ 2 , 049.00 

• 2 , 180.25 

813.12 
4 , 962.87 
866.25 
2 , 161.93 
11 , 423.75 
24 , 185.88 
580.00 
1 , 792.70 
17 , 296.01 
14.40 
697.16 
112.56 
2 , 030,08 
17 , 193.20 
173.60 
,., 174. .90 
6 , 858.60 
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Table 96 — Continued 

Confutation of Coefficient of Correlation^ Cotton Production and 

Cotton Prices 


O', = 

= 

r =sz 




171,865,603 


35 


2,216.0 


227.2511 


35 

- 128,167.61 


= 2.548 


2x1/ ^ 

35 X 2,216.0 X 2.548 


- .648. 


the corresponding deviations of cotton prices in New York, 
during the period covered. 

From the values already computed we may derive an 
equation for estimating the variation in cotton price associ- 
ated with a given variation in production. This regression 
equation, as we have seen, is of the type 


(fy 

y = r-^X. 
(Tx 


In the present case y and x refer to deviations from the 
parabolic lines of trend. Substituting the given values, we 
have 


2/ == - .648 


2.548 

2,216 


X 


y = — .00074a:. 


This equation means that, on the average, a unit devia- 
tion of cotton production (x) above the line of trend was 
accompanied by a deviation of .00074 units in cotton prices 
iy) below the line of trend. The unit employed in the 
production figures was 1,000 bales, in the deflated price 
figures, one cent. In the interpretation of the equation 
it may be simpler to use an x-unit of one million bales, 
making the equation of regression 

y = — .74x. 

Thus a cotton crop one million bales above trend was 
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aceompanied by prices about three quarters of a cent per 
pound below trend (with reference always to deflated 
prices). This was the average relationship during the 
period 1901-1936. It did not hold in all cases, as is shown 
by the fact that r has a value of but — .648. If this, or 
a similar law, held perfectly, r would have a value of - 1. 

The value of 5, which measures the scatter about the 
line of regression, may be computed from the formula 

== 

In the present case, has a value of 1.94 cents. The 

significance of this measure has been explained in an earlier 
section. 

(It should be emphasized that the use of the above 
equation for estimating future prices is dependent upon 
the validity of projecting the two lines of secular trend.) 

In the preceding analysis deviations were measured in 
absolute units, and the results could be interpreted only 
in terms of absolute units, bales of cotton and cents per 
pound. For certain purposes it might have been more 
convenient to correlate percentage deviations from the two 
lines of trend, in which case the standard deviations and 
the equation of regression would have been expressed in 
these terms. The procedure, in this respect, will depend 
m part upon the use to which the results are to be put. 
The nature of the data will also affect a decision on this 
point. The use of percentage rather than absolute deviations 
wou d be desirable in handling series in which the range 
of absolute deviations had changed materially during the 
period covered. 

It is obvious that in the above problem there is an 
arbitrary element which was not present in the correlation 
problems previously studied. The deviations are measured 
rom lines of trend, not from the arithmetic means, and 
these lines of trend are arbitrarily selected. The use of 
different lines of trend might give quite different results. 
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In the above example the lines of trend were both power 
curves of the third degree. We might, perhaps with equal 
reason, assume that the underlying trends are best defined 
by other functions. Coefficients of regression and correlation 
would have different values if this were done. The presence 
of this arbitrary element in the correlation of deviations 
from lines of secular trend detracts somewhat from the 
confidence that may be placed in the results. The critical 
problem here lies not in the mechanical process of correla- 
tion, but in the choice of an appropriate line of trend for 
each series. If, by the tests of inspection and of corre- 
spondence with such external evidence as may be available, 
it appears that the curve selected accurately represents the 
trend in each of the series correlated, the coefficient may 
be accepted as significant. But, in the interpretation and 
use of the results, the presence of this element of personal 
judgment in the preliminary calculations must not be 
forgotten. This applies with particular force if the study 
aims to establish a functional relationship between cyclical 
fluctuations in the two series, and if an estimating (or 
regression) equation is to be based upon the results. 

The Coefficient op Coerelation and the 
Measurement of Time Sequence 

In the correlation of cotton production and cotton prices 
the object was to measure as accurately as possible the 
effect of variations in cotton production upon cotton prices. 
An equation was secm-ed which described this relation 
when deviations were measured from the particular fines 
of trend employed. Cotton prices were considered to be a 
function of cotton production, and the object of the study 
was to measure this functional relationship. We seek, 
in such cases, to determine the degree to which cycles 
in one series depend upon or reflect cycles in a related 
series, assuming some functional relationship between them. 
This is essentially the problem described in introducing the 
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subject of correlation, and generally constitutes the major 
problem in studying the relation between series of any type. 

But a second and somewhat different problem may be 
faced in certain studies of time series. Assuming that 
two such series are marked by definite cycles, it is of 
interest to determine whether the cycles coincide in time 
or whether cycles in one series consistently precede or lag 
behind cycles in the other. The coefficient of correlation 
has been found very useful in determining the degree of 
“lead” or “lag” in such cases. This problem is that of 
determining merely temporal relationship, as opposed to 
ijixQ functional relationship that is ordinarily to be measured 

THE EELATION BETWEEN STOCK PRICE CYCLES AND CYCLES 
OP BUSINESS ACTIVITY 

To illustrate the solution of a problem of this latter type, 
we may undertake to determine the relation, in time' 
between cyclical movements in industrial stock prices and 
in general business activity, as measured by the composite 
index compiled by the American Telephone and Telegraph 
Company. The monthly values of this index for the period 
1899-1937 have been presented in an earlier section. Figures 
relating to stock prices from January, 1903, to June, 1914 
are given in Table 97. 


Table 97 

Cycles in Industrial Stock Prices, 1903-1914 i 
(Figures relate to deviations from trend in units of tlie standard deviation) 


Month 1903 1904 

January — .2—2.0 
February — ,1 —2.1 
March — .3—21 
April. - .5-2.0 
May - .5-2.1 
June - .9-21 
July - 1.4 - i.g 

August - 1.7 - 1.6 
September — 1.9 — 1,3 
October -- 2.3 — .9 
November —2.4 — 3 4-11 

tJeoember -2.1 - .1 4.1;$ 


1906 1906 1907 

- .1 +2.3 + 1.6 


2 +2.2 
.6 +1.9 
.7 +1.7 
.2 +1.4 
.3 +1.5 
.7 + 1.3 
.8 +1.7 
.7 + 1.7 
.8 +1.7 


+ 1.7 - l.’o 4" 
4* 1.7 - 1.6 + 


1908 1009 1910 

- 1.3 + ,5 + 1,0 

.4 -1.6 + .3 + 5 

.0 - i.i + .3 + ;« 

.6 - .8 + .5 + ..5 

.5 + .8 + .4 

.5 + .9 + 

.2 + 1.1 - 

.3 +1.4 - 

.1 +1.4 - 

.2 +1.4 
.5+1.4' + 

.5 +1'.3' 


.4 - 
. .2 - 
,3 - 
.3 + 
.6 + 
1.3 + 


.1 


1011 
- .1 
+ .1 
.1 


- .1 


1912 

- .4, 

- .4 

- .1 

, 1 ^ 3 

0 + .2 
+ .2 


.4 + .1 + .2 
.3 - ,3 + ,3 
.3 - .7 + .4 
0 - .7 + .3 

.1 - .5 + .2 
■2 - .4 0 


1913 

— .2 
_ r 

~ .6 

- ,7 

- 1.1 

- .9 

- .7 

- ' .5 

- .8 


1914 

- .7 

— .6 
- .0 
— ,8 
— . . 8 


^ These figures, the results of analyses by W. M. Persons are from the 
published by the Harvard Committee on Em- 
nomic Research. They are based upon the average price of 12 industrial stocLs. 
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The data of the two series are plotted in Fig. 78.^ From 
a comparison of the two curves in this chart it is clear 
that there is some relation between the movements in 


Scale for • Scale for Index of 

Index of Industrial General Business 



Fig. 78. — Comparison of Cj'-dical Fluctuations in Industrial Stock Prices 
and in General Business Activity, 1903-1914 

the two series, but such a comparison affords no basis 
for a definite conclusion. Our object is to determine whether 
the cycles in the two series are exactly synchronous and, 
if they are not, to measure the average time interval by 
which cycles in one series precede the cycles in another. 
The significance of such studies in the analysis of the business 
cycle is obvious. 

For the study of pre-war relations data for the period 
from January, 1903, to June, 1914, may be employed. 
A coefficient of correlation is first computed for concur- 
rent items. A value of + .55 is secured. Next, the data 
are correlated with industrial stock prices preceding general 

^ The American Telephone and Telegraph index here plotted is not identical 
with that given in Chapter IX. The latter is a revised series, differing in some 
respects from the original index for the pre-war period that has been used in 
the present calculations. 
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business by one month. That is, the January, 1903, figure 
for stock prices is multiplied by the February, 1903, index 
of general business; the February stock price is multiplied 
by the March business index, etc. This process is carried 
through for the entire period from January, 1903, to June, 
1914. Only 137 monthly values are used in this computa- 
tion, as compared with 138 in the preceding case, for the 
January, 1903, business index and the June, 1914, stock 
price figure do not enter into the calculations. Accordingly, 
the values Cx and Cj, (the two corrections to be applied 
because the origin does not coincide with the two averages) 
and the two standard deviations will be slightly different. 
These corrections may be readily made. The coeflaeient 
of correlation secured from these computations has a value 
of + .65. The same operation is repeated with other 
pairings of the two variables. The results are smnmarized 
below. 


These figmes are plotted in Fig. 79. 

Tte coefficients increase to a maximum value of + 76 
which IS secured with stock prices preceding general business 




Coefficients of Correlation between Industrial Stock 
Index of General Business Activity 
(Based upon data for the period 1903-1914) 

Coefficie 

Stock prices concurrent with business index 
Stock prices preceding business index by 1 month 
“ “ “ “ “ “ 2 months 


Prices and an 


nt of Cmrelation 

+ .55 
-f .65 
-b . 70 
+ .73 
-f .76 
+ .76 
+ .76 
-f" . 74 
+ .71 
-b .67 
+ .61 
+ .54 
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by 4, 5, and 6 months. The stability of the coefficients 
with the period of “lead” varying from 3 to 7 months 
indicates that there was no one specific interval, within 
the limits thus indicated, between the cyclical movements 
of these two series. From the results here given it would 



Lag in months 


Fig. 79. — Coefficients of Correlation between Index of Industrial Stock 
Prices and Index of Business Activity, 1903-1914, Showing the Results 
Secured with Different Pairings. (In all pairings except that of concurrent 
items the business activity index follows the stock price index) 

appear that five months was the average interval by which 
stock prices preceded the general business index, but this 
was not sharply marked off as a constant relationship. 

With this record of pre-war relations we may contrast the 
experience of recent years. The Index of Industrial Activity 
of the American Telephone and Telegraph Company, given 
in Chapter IX, defines the state of business. Of stock 
price index numbers, the measurements currently published 
in the Review of Economic Statistics ' are in a form best- 

^Tbis is not a bomogeneous series for- the entire period covered. For, the 
years 1919-”1924 the index is based on the average price of 20 industrial stocks 
(the Dow^-Jones index), expressed as deviations from trend in units of the 
standard deviation. For the period 1925-1937 the official aE-inclusive index 
of the New York Stock Exchange (index No. 2) has been used. This index, 

{Footnote 1 cmMmied on page ) 
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adapted to our present needs, although a change in coverage 
during the period detracts somewhat from their utility, for 
comparative purposes. Monthly values of this index for 
the period 1919-1937 are recorded in Table 99. The two 
series are plotted in Fig. 80. 

Table 99 


Cycles in Stock Prices, 1919-1937 ‘ 


Month 

1919 

1920 

1921 1922 ■ 1923 

1924 

1925 

1926 

1927 

1928 

Jan. 

+ .38 

+ 1.44 

- .69 - 

.68 - .12 

- .47 

+ .15 

+ 1.11 

+ 1.04 

+ 2.61 

Feb. 

+ .38 

+ .78 

- .70 - 

.63 + .05 

- .43 

+ .17 

+ .90 

+ 1.31 

+ 2.31 

March 

+ .70 

+ 1.06 

- .72 ~ 

.38 + .15 

- ,58 

- ,21 

+ .33 

+ 1 . 25 , 

+ 2.96 

April 

+ .90 

+ 1.10 

- .68 - 

.18 - .02 

- .81 

- .12 

+ .52 

+ 1.32 

+ 3.37 

May 

+ 1.40 

+ .50 

.07 

.11 - .31 

- .88 

+ . 19 

+ .55 

+ 1.56 

+ 3.38 

June 

■ 4 - 1.76 

+ ..46 

- 1.15 - 

.14 - .46 

- .82 

+ .27 

+ .81 

+ 1.39 

+ 2.82 

July 

+ 2.01 

+ .38 

- 1.20 - 

.08 - .70 

- .58 

+ .41 

+ 1.00 

+ 1.94 

+ 2.89 

Aug. 

+ 1.50 

+ .04 

- 1.32 + .05 - .65 

- .39 

+ .43 

+ 1.11 

+ 2.07 

+ 3.37 

Sepi . 

+ 1.78 

+ .11 

- 1.16 + 

.10 - .70 

- .,44 

+ .53 

+ 1,13 

+ 2 . 42 ' 

+ 3.63 

Od . 

+ 2.14 

.04 

- 1.12 + 

.10 - .84 

- , 60 

+ 1.01 

+ .89 

+ 2,06 

+ 3.69 

Nov. 

+ 1.90 

- .44 

^ . 89 '- 

.16 .72 

- .25 

+ .00 

+ 1.07 

+ 2.65 

+ 4.39 

Dec . 

+ 1 . 54 . 

- .85 

- .70 - 

.10 - .60 

- . 02 

+ 1.05 

+ 1.05 

+ 2.67 

+ 4.41 

Month 

1929 

1930 

1931 

1932 1933 

1934 

1935 

1936 

1937 

Jan . 

+ 4.21 

+ 1.11 

- 1.35 

- 4.02 - 

4.31 - 

- 2.83 

- 3.30 ■ 

- 1.61 

■',.64 

Fel). 

+ 4.11 

. + 1..28 

- .83 

- 3.90 - 

4.65 

- 2.90 

- 3.37 

- 1 . 51 ' 

.—",',.61 

March 

+ 3.70 

+ 1.83 

— 1.22 

- 4.20 

4.62 -2.89 

-■ 3.51 

- 1.51 

.66 

April 

+ 3.84 

+ 1.59 

- 1.73 

- 4 ,tM - 

3.91 

- 2.93 

- 3.23 

- 1.92 , 

- 1.11 

.May 

+ 3.17 

+ 1 , 41 

- 2.35 

- 6.05 - 

3.33 

~ 3.19 

- 3.14 

- 1.71 

- 1.17 

June 

+ 3.88 

+ .13 

- 1.85 ' 

- 5.09 - 

2.90 

- 3.13 

- 2.97 

- 1.61 

- 1.44 

July 

+ 4.25 

+ .25 

- 2.15 

- 4.60 - 

3.26 

- 3.51 

- 2.71 

- 1 . 31 . 

- 1.03 

Aug. 

+ 4.89 

+ .24 

- 2.18 ' 

- 3.86 - 

2.89 

- 3.37 

- 2.62 

- 1.27 

- 1..31 

8ept. 

+ 4.10 

- .51 

-■ 3.42 

-. 3.97 - 

3.31 

- 3.40 

- 2.55 

- 1.23 

- 2.04 

Od. 

+ 1.67 

- 1.04 

- 3. ,23 

- 4.30 - 

3.57 

-3.45 

- 2.29 

- ,.90 

- 2.48 

Nov. 

+■ .68 

- 1,21 

3.64 

- 4.42 

3.32 

- 3.21 

- 2.10 

- .78 

-. 2.85 

Dec. 

+ .74 

™ 1.65 

- 3.99 

- 4.37 - 

3.27 

- 3.21 

- 1.93 

- .81 

- 3.03 


Results obtained from a study of the temporal relations 
between these two series, for the period 1919-1937, are 
given in Table 100. 

{Footnote 1 continued from, page S9S) 

originally constructed with the figure for Jan. 1, 1925 as 100, has here been 
expressed in terms of deviations from 100, in units of a standard deviation 
assumed to be equal to 15 on the original scale. In effect, a horizontal trend 
at the ie\'ei of Jan. 1, 1925, has been assumed for the Stock Exchange index. 
This index has also been shifted sliglitly in time. The index figure relating 
to the first day of a given month, in the Stock Exchange tabulations, has here 
been recorded as for the month preceding. Thus a February 1st index is en- 
tered for January, a March 1st index for .February, etc, 

^ From the Review of Economic Statistics, The figures in the table define 
deviations from trend, in units of the standard deviation, with the assumptions 
stated in the preceding footnote. The coefficients in Table 100 are based upon 
data through July, 1937, only. 



Table 100 


Coefficients of Correlation between Stock Prices and an Index of 
Business Activity 

(Based upon data for the period 1919-1937) 

Coefficient of Correhiion 


Stock prices concurrent with business index 
Stock prices preceding business index by 1 month 
!! “ !! “ “ “ 2 months 


+ .85 
+ .86 
+ .85 
+ .83 
+ .82 
+ .80 
+ .78 
+ .76 
+ . 74 
+ .71 
+ .68 
4 * . 65 


These measurements are shown graphically in Fig. 81 
In using these coefficients we should note that the^stock 
price records for part of the recent period are different 
m important respects from those employed for the pre-war 
peiiod. In place of the 12 industrial stocks entering into 
the earlier comparisons the index for the recent period 
included 20 stocks and, later, a comprehensive list composed 
of all varieties of stocks. The market behavior of the 
broader list may have departed somewhat from the pattern 
set by the limited number of industrial stocks. The differ- 
ence between the results for the two periods is to be inter- 
preted with this fact in mind. 

In post-wai- years the highest degree of correlation pre- 
vailed with the business index following the stock price 
index by one month. The traditional “lead” of stock 
prices, on the basis of which the movements of these prices 
been used as forecasters of business changes, was 
reduced m this period. The actual statistical record 
/e obtained may have been affected somewhat bv 
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used, but the change in the relations between the two 
series appears to have been a real one. 

This method of measuring temporal relations between 
economic series is highly useful, but one important caution 
should be noted. The method indicates the average degcm 
of lead or lag of one series, with reference to another. 
Frequently the sequences of change in economic series are 
not the same in all phases of business cycles. Thus, observa- 
tions relating to ten business cycles occurring between 1890 



Lag in Months 


Fig. 81. — Coefficients of Correlation between Index of Industrial Stock 
Prices and Index of Business Activity, 1919-1937, Showing the Results 
Secured with Different Pairings. (In all pairings except that of concurrent 
items the business activity index follows the stock price index) 

and 1925 indicate that pig iron prices preceded the general 
index of wholesale prices by 3.4 months, on the average, 
in business recessions, but followed the general index by 
5.1 months, on the average, in periods of business revival.^ 
This highly important difference would be ironed out in 
the measurement of average temporal relations by the 

* Cf. The Behavior of Prices, Now York, National Bureau of Economic 
Research, 1927,84-87. 
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correlation method. The use of an average should be 
supplemented by a study of the items entering into the 
average. Study of the relations among individual observa- 
tions at different cyclical phases is essential when correlation 
technique is employed to define time sequences among the 
movements of econonaic series. 

The U se op the Moving Aveeage in Coeeelating 
Cycles in Time Seeies 

The preceding discussion has dealt only with cycles as 
measured from mathematically fitted lines of trend. But 
trend may be measured, as we have seen, by lines based 
upon moving averages, and the cyclical deviations from 
such lines may be correlated in precisely the same way as 
deviations from other lines of trend. The arithmetic 
mean of the deviations from such moving averages will 
not necessarily be zero, as in the case of deviations measured 
from lines fitted by the method of least squares, and a corre- 
sponding correction must be made in correlating such figures. 

Moving averages are subject to the same criticism as 
are mathematical lines of trend. There can be no certainty 
that deviations from lines of trend based upon moving 
averages represent the effects of cyclical causes solely. The 
result in a given case depends upon the period of the moving 
average employed, and there is no perfect criterion by 
which to determine the best measure of trend. Significant 
and useful coefficients may be computed when deviations 
are measured from moving averages, but the presence of 
an arbitrary element in the work must be recognized and 
the results applied with corresponding reservations. 

The Coeeelation op Short Term Fluctuations 

In describing the variable factors that constitute compo- 
nent elements of the values of a series in time, it was pointed 
out that the coefficient of correlation would not generally 
be employed in comparing either the secular trends or the 


SHORT TERM MOVEMENTS 


399 


seasonal fluctuations of two series. It may be used to 
advantage in measuring either functional or temporal rela- 
tions between cyclical fluctuations, provided that the effects 
of the other variables have been, so far as possible, elimi- 
nated. The coefficient of correlation and the measures 
which are employed in conjunction with it have a further 
use in dealing with time series. They may be used to meas- 
ure the relation between short term changes in two series, 
changes from year to year, month to month, or even from 
week to week or day to day, if desired. This problem is 
distinct from that studied in the preceding section and in 
the interpretation of the results the tw’o should not be 
confused. 

There are several ways in which the problem of comparing 
short term fluctuations may be attacked. The absolute 
differences betw'een successive items in two series may be 
coi’related, or these differences may be expressed as per- 
centages or ratios. Table 101 illustrates the procedure 
employed in measuring the correlation betw-een the absolute 
fluctuations from year to j^ear (first differences) of cotton 
production and cotton prices. The original values from 
which the items in columns (2) and (3) are derived are 
given in Table 95. 

The process of computing r is identical with that em- 
ployed in preceding examples, when deviations were meas- 
ured from an arbitrary origin. The arbitrary origin in 
this case is zero, but corrections must be made in the 
various values since the algebraic sum of the given figures 
is not zero in either case. Computations based on the 
figures in Table 101 follow: 


Ct 


SZ + .933 
N 34 
= .000753 


= + .02744 




229.624987 


34 


.000753 = 2.599 
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Table 101 


Computation of Coefficient of Correlation between Cotton Production 
and Cotton Prices, 1902-1936 

(Based upon first differences) 


( 1 ) 


Crop 

Year 


( 2 ) 

Diference be- 
tween produc- 
tion in given 
year and pro- 
duction in pre- 


(3) 

Diference be- 
tween price in 
given year and 
price in pre- 

' /^e/n ta 'net' 'nnunn a 


(4) 


(5) 


(6) 



millions of 
bales) 

cenis per pouna 
deflated) 

f 





X 

Y 


Y2 


ly 

1902-^3 

01.121 

4" . S4 

1.256641 

.2916 

+ 

.60.134 

1903-04 

- .780 

4* 4 . 34 

.608400 

18.8356 


3.38.120 

1904-05 

4-3.587 

- 5.17 

12.866569 

26.7289 

— 

18.54479 

1905-06 

- 2.863 

4-2.62 

8.196769 

6.8644 


7.50106 

1906-07 

4-2.699 

-1.25 

7.284601 

1.5625 


3 . 37375 

1907-08 

- 2.167 

4- 1.14 

4.695889 

1.2996 


2.47038 

190809 

4" 2. 135 

- 1.50 

4,558225 

2.2500 


3.20256 

1909-10 

-- 3.237 

4 3.79 

10.478169 

14.3641 


12.26823 

1910-n 

4 1.604 

O .60 

2.572816 

.3600 

+ 

.96240 

1911-12 

4-4.084 

- 4.79 

16.679056 

22.9441 


19.56236 

1912-13 

- 1.990 

4 1.44 

3.960100 

2.0736 


2.86560 

1913-14 

4 

.453 

4 1.62 

.205209 

2.6244 

+ 

.73386 

1914-15 

4 1.979 

- 5.20 

3.916441 

27.0400 


10.29080 

1915-16 

- 4.943 

4 1.73 

24.433249 

2.9929 

— 

8.55139 

1916-17 

4 .258 

4 2.18 

.066564 

4.7524 

+ 

.56244 

1917-18 

- .148 

4 303 

.021904 

9.1809 


.44844 

1918-19 

4 .739 

- .34 

.546121 

.1156 


.25126 

1919-20 

- .620 

4 2.27 

.384400 

5.1529 


1.40740 

1920-21 

4 2.019 

- 6.02 

4.076361 

36.2404 


12.1543S 

1921-22 

- 5.486 

4-3.63 

30.096196 

13.1769 

— 

19.91418 

1922-23 

+ 1.808 

4 2.86 

3.268864 

8,1796 

+ 

5.17088 

1923-24 

4 

.378 

4 4.34 

. 142884 . 

18.8356 

+ 

1.64052 

1924-25 

4 3.488 

-5.70 

12.166144 

32.4900 


19.88160 

1925-26 

+ 2.476 

- 2.79 

6.130576 

7.7841 

— 

6.90804 

1926-27 

+ 1.873 

■■ -3,28 

3.508129 

10.7584 


6.14344 

1927-28 

- 5021 

4 3.42 

25.210441 

11.6964 

— 

17.1.7182 

1928-29 

4-1.522 

4 .27 

2,316484 

.0729 

+ 

.4109-1 

1929-30 

4- 

.347 

- .93 

. 120409 

.8649 


.32271, 

1930-31 

- .893 

-3 13 

.797449 

9.7969 

+ 

2.79509 

1931-32 

+ 3.164 

- 2.28 

10.010896 

5.1984 


7.21302 

1932-33 

- 4094 

+ 1.39 

16.760836 

1.9321 


5.69066 

1933-34 

4- 

.045 

+ 2,04 

.(X)2025 

4.1616 

+ 

.09180 

1934-35 

- 3.411 

+ .75 

11.634921 

.5625 


2.55825 

1935-36 

±,„ 

.807 

- 1,35 

.651249 

1.8225 


1.08945 


4* 

.933 

+ .27 

229.624987 

313.0067 

-180.19834 
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.000063 = 3.034 
(.02744 X .00794) 


The equation of regression and the value of Sy, computed 
from the usual formulas, are 

V = - .78;r 
= 2.25 cents. 

A comparison of the different results secured in the 
preceding examples relating to cotton throws some inter- 
esting light upon the general problem of correlation. In 
fact, in the two examples, we have measured the correlation 
between measurements that are not strictly comparable — 
deviations from third degree parabolas, in the first case, 
and year-to-year fluctuations in the production and price 
of cotton, in the second. Yet, if we were seeking to estimate 
the price of cotton which would accompany a given crop, 
an estimate might be based upon either of the studies, 
the results of which are given below. 



Correlation of cycles in cotton production and 
prices (deviations measured from third degree 
parabolas) — . 648 1 . 94 cents 

Correlation of year-to-year fluctuations, same data — .672 2.26 cents 

The value of r in the second example is slightly greater 
than the value secured in the first case, though the standard 
error is also larger. The reason for this apparent contradic- 
tion has been suggested above; the standard deviation of 


-f .27 
34 


+ .00794 


,/SF2 ^ /313.0067 

V N y 34 

S(X7) - 180.19834 

= 3 ^ 

- 5.300168 

p ^ - 5.. 300 168 
(T^cTy 2.599 X 3.034 

- .672. 
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the year-to-year fluctuations in cotton prices is greater 
than the standard deviation about the trend of cotton prices. 

It appears that errors of estimate are less when based 
upon the results secured when deviations from third degree 
curves are correlated than when based upon the study of 
year-to-year movements. But there is a concealed assump- 
tion in the first case, the assumption that the lines of trend 
of both prices and production may be projected beyond the 
period studied. There is an immeasurable margin of error 
in this assumption, and the standard error of estimate, 
accordingly, does not give a true measure of the probabilities 
involved. No such assumption is involved in the measure 
based upon year-to-year fluctuations. 
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CHAPTEE XII 


THE MEASUREMENT OF RELATIONSHIP: 

NON-LINEAR CORRELATION 

In the preceding chapters the discussion has been confined 
to cases in which the relationship between two variables 
may be described by a straight line. The coefficient of 
correlation, r, is a measure of the degree to which two 
variables approach a linear relationship and it is signifi- 
cant only when a straight line gives a good fit to the points 
representing the paired values of X and Y. 

In fitting curves to time series, as explained in an earUer 
section, it is found that in many cases the trend is non- 
linear, and that a curve of higher degree is needed. The 
same thing is true in the field of our present discussion. 
It is possible to have a high degree of correlation between 
two variables when a straight line does not describe the 
relationship. In such a case there would be considerable 
scatter about the straight line of best fit, and the value 
of r would be misleadingly low. If a curve representing 
the real relationship could be fitted, the scatter would 
be materially reduced and the true correlation could be 
measured. The figures presented in Table 102 illustrate 
such a ease. These data are plotted in Fig. 82. 

Two different curves have been fitted to the points 
plotted in this figure. One is a straight line having the 
equation 

Y = 5.038 -I- .0886X 

in which F represents 3 deld, in tons per acre, and X repre- 
sents depth of irrigation water applied, in inches. The 
degree of relationship between the two variables, as de- 
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scribed by this line, is indicated by the coeiBcient of 
correlation, r, which has a value of + . 69. 

Table 102 

Alfalfa Yield and Irrigation 

Summary of investigations at Davis, California ' 

(The measurements in the body of the table measure yields, in tons per acre, 
in 44 experiments) 

Inches of irrigation water applied 


0 

12 

18 

24 

30 

36 

48 

60 

2.35 

4.31 

5.69 

6.00 

7.53 

7.58 

8.05 

5.55 

2.75 

4.78 

6.46 

6.89 

7.97 

8.22 

8.45 

7.25 

2.89 

4.84 

7.02 

7.96 

8.32 

8.63 

8.63 

10.17 

3.85 

5.83 

8.02 

8.32 

9.43 

9.33 

8.83 

10.70 

5.52, 

6.51 


8.38 

9.54 

9.38 

9.52 



Average 5.94 7.52 9.96 11.06 12.48 10.62 

yield 3.88 5.63 6.80 7.92 8.98 9.27 9.02 "8.42 7.48 

An inspection of the figure shows clearly that the straight 
line does not give the best possible fit. It is certain, there- 
fore, that r does not furnish a valid measure of the degree 
of relationship between alfalfa yield and depth of irrigation 
water. 

Parabolic Relationship 

The other curve in Fig. 82 is a second degree parabola, 
fitted by the method of least squares. The equation to this 
curve is 

F = 3.539 -H .2527X - .002827X1 

It is obvious that the effect of increasing irrigation upon 
alfalfa yield is described much more accurately by this 
latter curve than by the straight line. The most important 
result of these investigations was the determination of the 
point at which alfalfa yield began to fall off with increased 
applications of water, and the straight line fails to indicate 
any such decline. 

’ This table is taken from “The Economical Irrigation of Alfalfa in the Sac- 
ramento Valley^' by S. H. Beckett and B. D. Eobertson, Bull. No. ^80 y 
Agricultural Experiment Stiition, Univ. of Galifornia, May, 1917. 
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As the equation of relationship, therefore, we should use 
the parabolic rather than the linear form. The standard 
error, /S„, which is a necessary accompanying measure, may 
be calculated by measuring the deviation of each value 
from the corresponding computed value, and determining 



Fig. 82 . — Scatter Diagram Showing the Relation between Alfalfa Yield 
and Irrigation Water Applied, with Two Lines of Regression 

the root-mean-square of these deviations. This procedure 
is illustrated in Table 103. The figures for normal jdeld 
which are given in this table are computed from the parabolic 
equation given above. 

Inserting the sum of the squared deviations, as given in 
col. (5) of Table 103, in the formula 

N 


we have 



= 1.36. 


44 
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Table 103 

Comparimn of Actual and Computed Alfalfa Yield 


(1) 

(2) 

(3) 

(4) 

(5) 

Depth of 

IfTigoitiofi 


Normal yields 

Deviation of 


Actual yield 

as computed 

actual from 


water 

from parabolic 

normal 



equation 

(2)~(3) 



Y 

Yc 

d 


.0 

3.85 

3.54 

+ .31 

.0961 

0 

5.94 

3.54 

+ 2.40 

5.7600 

. 0 

5.52 

3.54 

+ 1.98 

3.9204 

0 

2.75 

3.54 

-- .79 

.6241 

' ■ '0 

2.89 

3.54 

— .65 

,4225 

0 

2.35 

3.54 

-- 1.19 

1.4161 

12 

4.78 

6.16 

-- 1.38 

1.9044 

12 

7.52 

6.16 

+ 1.36 

1.8496 

12 

6.51 

6.16 

+ .35 

.1225 

12 

4.31 

6.16 

- 1.85 

3.4225 

12 

5.83 

6.16 

-- .33 

.1089 

12 

4.84 

6.16 

-- 1.32 

1.7424 

18 

7.02 

7.17 

.15 

.0225 

18 

5.69 

7.17 

~ 1.48 

2.1904 

18 

8.02 

7.17 

+ .85 

7225 

18 

6.46 

7.17 , 

- .71 

.5041 

24 

6.00 

7.98 

- 1.98 

3.9204 

24 

8.38 

7.98 

+ .40 

.1600 

24 

8.32 

7.98 

+ .34 

.1156 

24 

6.89 

7.98 

-- 1.09 

1.1881 

24 

9.96 

7.98 

+ 1.98 

3.9204 

.. 24 

7.96 

7.98 

- .02 

.0004 

30 

7.53 

8.58 

- 1.05 

1,1025 

■ ■ 30 

9.54 

8.58 

+ .96 

,9216 

30 

9.43 

8.58 

+ .85 

.7225 

. 30.. 

7.97 

8.58 

-- .61 

.3721 

' 30 

11.06 

8.58 

+ 2 . 48 

6.1504 

30 , 

8.32 

8.58 

- .26 

.0676 

■■ ■ 36 

7.58 

8.97 

- 1.39 

1.9321 

36 

9.33 

8.97 

+ .36 

.1296 

36 , 

9.38 

8.97 

+ .41 

.1681 

■■■■■36'"' 

8.22 

8.97 . 

~ .75 

.5625 

36 

,, :12.48 

8.97 

+ 3.51 

12.3201 

36 

SM 

8.97 

.34 

.1156 

48 

■8.:45 

9.16 

- .71 

.5041 

48 

9.52 

9.16 

+ .36 

.1296 


(Conimued on next page) 
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Table 103 (Continued) 

Comparison of Actual and Computed Alfalfa Yield 


(1) 

(2) 

(3) 

(4) 

(5) 

Depth of 
irrigation 
' water 


Normal yield 

Deviation of 


Actual yield 

as computed 
from parabolic 

actual from 
normal 



equation 

(2) (3) 


X 

Y 

Yc 

d 


48 

8.63 

9.16 

- .53 

.2809 

48 

8.83 

9.16 

- .33 

.1089 

48 

10.62 

9.16 

+ 1.46 

2.1316 

48 

8.05 

9.16 

- 1.11 

1.2321 

60 

10.17 

8.52 

+ 1.65 

2.7225 

60 

7.25 

8.52 

1.27 

1.6129 

60 

10.70 

8.52 

+ 2.18 

4.7524 

60 

5.55 

8.52 

- 2.97 

8.8209 


80.9945 


THE INDEX OF CORRELATION 

We need now the third value, the abstract measure of 
degree of relationship. In dealing with cases of linear 
relationship in the preceding chapter we found that such 
a measure, the coefficient of correlation, could be derived 
from known values of and o-„. An analogous measure 
may be derived in the same way in cases of non-linear 
relationship, such as that found in the present problem. 
Since the term coefficient of correlation and the symbol r 
refer only to cases of linear regression, we may term this 
general measure the index of correMion, and use the symbol 
p (rho) to represent it. 

As a general formula for the index of correlation we 
have’ 

^ With X dependent this formula becomes 



The first of the two subscripts refers always to the dependent variable, the 
second to the independent. It is essential tliat these be shown, for p would 
not necessarily be the same with X dependent as with Y dependent. Such a 
distinction is not necessary in the case of linear correlation, for r is the same 
no matter which variable be dependent. 
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The value of Sy has been derived above. The value of 
ffy, computed by familiar methods, is found to be 2.27. 
Substituting in the formula for p, we have 



This value is materially greater than that of the coefficient 
of correlation for the same data. The value of r is + . 69. 
The difference is due to the fact that the second degree 
parabola constitutes a much better fit to the data than 
the straight line. The correlation is distinctly non-linear, 
and r is an inappropriate measure of correlation. 

THE MEANING OF THE INDEX OF COBEBLATION 

It is important that the significance and the limitations 
of p be understood. Its value depends upon the relation 
between the scatter about the fitted line and the scatter 
about the arithmetic mean of the F’s. In the case of a 
straight line, p and r are identical, r being a special case 
of p. The limits of p are 0 and 1, a value of 0 indicating 
that there is no relationship, or that if there is a relation- 
ship between the two variables it cannot be described by the 
particular equation employed. A value of 1 indicates that 
the relationship, as described by the equation employed, 
is a perfect one. For curves of higher degree no positive or 
negative sign should be attached to p, for the relationship 
might be positive over part of the range and negative over 
other parts, as in the alfalfa example given above. 

The index of correlation, p, has no significance unless the 
type of curve to which it applies be named in each case. 
The meaning of r in this respect is always clear, for it is 
understood that it relates always to a straight line, but 
confusion would arise in the case of p unless the type of 
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curve were specifically mentioned. The index of correlation 
may be looked upon as a measure of the adequacy of a 
curve of given type to describe the relationship between two 
variables. 

It is, of course, always possible to secure a curve which 
will pass through any number of points if the constants 
in the equation be equal to the number of points. In such 
a ease p would, of necessity, be equal to 1, but this value 
would have no significance. In any employment of mathe- 
matical functions there is this limit of absurdity, when the 
n um ber of constants is equal to the number of points, and 
p would merely reflect this absurdity. The ordinary prin- 
ciples of curve fitting must be kept in mind in using such 
an index as this. It must never be taken to have an absolute 
significance, standing by itself. Its significance is always 
relative, referring to the particular function employed. 
This fact, which is true of every measure of correlation, 
is frequently overlooked, and invalid and fallacious con- 
clusions reached as a result. 

A SHORT METHOD OP COMPUTING THE INDEX OP CORRELATION 

The standard error and the index of correlation were 
computed by a rather laborious method in the above ex- 
ample, in order that there might be no misunderstanding 
of their precise meaning. The burden of calculation may 
be materially reduced, however, by taking advantage of 
the relationships which were disclosed in dealing with r. 
For a curve of the potential series 

F = a -f- -f -f ... 

the formula for Sy is derived by a simple extension of that 
employed in the ease of the straight line. As a general 
formula for a series of this type, we have 

c 5 _ S(F^) - aS(F) - 6S(ZF) - cZ{XW) - d2(Z<*F) - . . . 

' * “ ' N 

Similarly, the formula for r may be extended to give a 
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general formula for p applicable to any equation of this 
general type. This formula^ is 

2 = + y^jXY) + cSq^F) +dS(X^F) + ■ •'•- iVc/ 

S(72) - NCy^ 

In the special ease in which the origin is at the mean of 
the F’s, 2(2/) =0 and = 0, and the formula reduces to 

o 2 = feS(Zy) + c2(ZV) + d2(XV) + . . . 

2(^^) 

The characteristics of the formulas for 8 and p should 
be noted. The only values requhed in securing these 
measures are the constants in the equation which describes 
the average relationship, certain values which have been 
used in the process of fitting and, in addition, 2(F^) and 
Thus, as direct by-products of the fitting process, we 
have the values of 8 and p, the two measures which are 
needed to supplement the regression equation in securing 
a complete description of the relationship between the two 
variables in question. The equation describes the average 
relationship. The standard error, 8, is a measure of the 
reliability of estimates based upon this equation, and p is 
an abstract index of the degree of relationship, in so far as 
that relationship can be described by the particular curve 
employed. 

The application of these formulas may be illustrated 
with reference to the problem of alfalfa yield. The following 
values, derived from the data of Table 102 and from the 
fitting process, are required for this purpose: 

a = 3.539 2(Z2F) = 407,564,64 

h = .252652 c/ = 55.9197 

c = -.002827 2(F2) = 2,688.2268 

2(F) = 329.03 iV = 44. 

2(ZF) = 10,271.72 

Substituting in the formula for the standard error for a 

^ See Appendix A for a discussion of the derivation of this formula. 
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second degree parabola, 

-^(7^) - aS(y) 


8y^ = 


&S(XF) - 


we have 

2.688.2268- (3.530 X 329.03) - (.252652 X 10.271.72) - f-. 002827 X 407,564.64) 




44 

80.8043 


44 

= 1.8365 
5„ = 1.36. 

The index of correlation, for a curve of this type, is 
computed from the equation 

, _ aS(F) + 6S(ZF) + cSCX^F) 

-Wc/" ' 

Substituting the appropriate values, we have 


Ac/ 


Pt/x 


146.9557 


2,688.2268 

.6452 

.80. 


(44 X 55.9197) 


The value of the index of correlation is influenced by 
the relation between the munber of observations and the 
number of constants in the equation of relationship. When 
the two are equal p will have a value of 1. In any case the 
observed index of correlation tends to exceed the true index. 
When the number of observations is not large it is advisable 
to apply a correction for this bias. If we use p to represent 
the corrected value and m to represent the number of 
constants in the equation of relationship, we may apply 
a correction in ternas of the relation^ 

Pj - 1 - I (I - 
Inserting the values given in the above example, we have 

' From Mordecai Ezekiel, Methods of Correlation Analysis, New York, 
Wa*!ir, 1980, 121. 
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,V-1- ((l-.6452)(^)} 

^:'=,,.6279 
pyx . / 9 . 

If, in the application of this test, the value in brackets { } 
exceeds unity, the value of p is taken as Od 

These methods of deriving S and p are applicable over 
a wide field by a simple adaptation of the formulas to the 
particular equations that may be employed in given 
instances. Further illustrations are given in Chapter XVII, 
while this general method is explained in more detail in 
Appendix A. 

The Correlation Ratio 

A third distinctive measure of correlation remains to 
be described. This is the correlation ratio, devised by 
Karl Pearson and represented by the symbol 7? (eta). 
This measure may be looked upon as a special case of p, 
but somewhat different methods are employed in its com- 
putation. 

We have seen that in all cases the degree of relationship 
between two variables, as described by a curve of a given 
type, may be determined from the formula 


Measure of correlation = 



The coefficient of correlation, r, is just such a measure, 
when represents the standard deviation about a straight 
line. The index of correlation, p, is a general measure of 
the same type. The correlation ratio is precisely the same 
sort of measure, Sy in this case representing the standard 
deviation about a line passing through the mean of every 


^ A corresponding correction should be made in the standard error of esti- 
mate, when derived from a small number of observations. In this case the 
correction must raise the unadjusted measure. For this correction Ezekiel 
gives 



where ^ represents the eorrected standard error of CBtimate, 
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column in the correlation table. We have, in effect, increased 
the number of constants in the equation of the curve to 
be fitted until the number is equal to the number of columns. 
If the means of all the columns lie on a straight line, the 
correlation ratio and the coefiicient of correlation will be 
equal. If the means of the columns do not he on a straight 
line, the correlation ratio will be greater than the coefficient 
of correlation. 

No new principle is involved, therefore, in the concept of 
the correlation ratio. It is employed when the regression 
is non-hnear. It measures the degree of relationship be- 
tween two variables, in so far as this relationship may be 
described by a curve passing through the mean of every 
column. If the relationship is perfect, if there is no scatter 
about the curve fitted in this way, 17 will have a value of 1. 
If there is no relationship, if the scatter about the curve 
is as great as the dispersion about the mean of the F’s, 17 
will have a value of zero. 

The formula generally employed in the computation of 
the correlation ratio differs somewhat from that given above. 
To represent the standard deviation about the line joining 
the means of the colunms, the symbol cfau is employed, 
instead of Its meaning is precisely the same as that 
of Sy, as employed above, except that aay refers always 
to a correlation table. 

The formula may be written 



When eta is written as above (^x) it refers to the regres- 
sion of F on X (F dependent). When it is written rixy it 
refers to the regression of X on F (X dependent), and its 
value depends upon the scatter about a line joining the 
means of the rows. Unlike r, which has the same value 
for both regressions, i7„i and rixy will have different values 
unless the regression be linear. 
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THE COMPUTATION OP THE COREELATION RATIO 
Table 104 shows the general relation between the amount 
of nitrogen, in pounds per acre, used as fertilizer in certain 
agricultural experiments, and the corresponding yield of 
wheat, in bushels per acre.^ The points are plotted in 
Fig. 83. 


Table 104 

Correlation Table Showing the Relation between Wheal Yield per Acre 
and Amount of Nitrogen Used as Fertilizer 




X — Nitrogen, applied in pounds per acre 



0- 

19.9 

20- 

39.9 

40- 

59.9 

60- 

79.9 

80- 

99.9 

100- 

119.9 

120- 

139.9 

140- 

159.9 

160- 
179. 9 

Total 

Mean 

of 

Rows 


32- 

35.9 




5 

16 

12 

4 

5 

2 

44 

107.27 

y — Wheat y ield in bushels per acre 

28- 

31,9 



1 

20 

21 


4 



1 


55 

88.91 

24- 

27.9 



16 

19 






35. 

60 . 86 

20- 

23.9 



13 







13 

50. 0 

16- 

19.9 


12 








12 

30.0 

12- 

15.9 


8 





■ 



8 

30.0 

S™ i 
11.9 

3 

5, ! 





i 



8 

22.50 

4- 

7.9 

10 









'lO ■; 

10.0 

0- 

3.9 

s 









8, 

10.0 

Total 

21 

25 

30 

44 

37 

20 

8 

6 

2 

193 


Mean 

of 

Columns 

5.05 

15. 12 

24.4 

28.73 

31.73 

,32.4 

32.0 

33.33 

34,0 




‘ This table is based upon experiments described by E. Davenport 
parative Agriculture” in Bailey’s Cyclopedia of American Agriculture). The 
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For the computation of riyx by the formula given above 
we need the values of ay and a ay, the latter being the 
root-mean-square deviation about the line joining the means 
of the various columns. The former value may be obtained 
readily by methods already familiar. It is possible to 
compute the quantity aay by the method first employed 



Fig. 83. — Scatter Diagram Showing the Relation between Wheat Yield 
and Nitrogen applied as Fertilizer, with Straight Line of Regression and 
line joining the Means of the Columns 


in calculating Sy, that is, by measuring and squaring the 
deviations of the individual points from the line of regres- 
sion. In the present case, however, the line describing 
the relationship passes through the mean of each column, 
hence these means may be used in place of the “normal” 
values as computed from an equation of regression. In 
computing aa«, therefore, the deviations of the individual 

jictual figufeH used have been arbitrarily chosen for the purpose of the present 
illiistratioii, but Davenport's experiments have demonstrated the existence 
’ of a law Bimikr to the one here assumed* 
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items from the means of the various columns are squared, 
added, the mean determined and the square root extracted, 
just as in the computation of the standard deviation. 
Part of the procedure is illustrated in Table 105, using the 
data in the first column of Table 104. This column contains 
all items having Z-values between 0 and 20. The mean 
F-value of the 21 items falling in this column is 6.05; 
deviations are measured from this value. 


Table 105 


Computation of the Squares of the Deviations about the Mean of an 


Class-interval 
{wheM yield in 
bn. per acre) 

in 

Array 

Deviation from 
mean of column 
(5.05) 

/ d 

if- 

/# 

8-11.9 

10 

3 4.95 

24.5025 

73.5075 

4- 7.9 

6 

10 .95 

.9025 

9.0250 

0- 3.9 

2 

8 -3.05 

9.3025 

74.4200 


Total 


156 . 952 .') 


The sum of the squared deviations is obtained for each 
of the other columns in a similar fashion. The standard 
deviation about the means of all the columns, <ray, is found 
to have a value of 2 . 420. The value of is 9 . 188. 

Substituting the given values in the formula 


~~ 1 


£r„" 




1 _ ( 2 - 42 )^ 
( 9 . 188)5 
= 1 - .0694 
= .9306 
rfys ■” . 965 . 


This is the value of the correlation ratio, measuring the 
degree of scatter about a line running through the means 
of the columns. Its significance is discussed below. 

The method of calculation employed in the preceding 
example may be materially shortened. Let ffmu represent 
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the standard deviation of the means of the various colunans 
about the arithmetic mean of all the F’s. In computing 
this value the mean of each column is weighted by the 
number of items in that column. It may be shown* that 


1 The following proof is adapted from Yule. 

Given a series with mean M made up of two component series with means Mi 
and M 2 . iV, the total number of observations, is equal to Ni N 2 , the sum of 
the observations in the two component series. What is the relation between o-, 
<71 and era? If we let Mi — M — ci 

then for Si^, the mean-square deviation of the observations in the first of the 
two component series, measured from M as origin, we have 

Similarly — 0 - 2 ^ 

But is equal to the sum of the squares of the deviations, about M, of 
tiie items in the first of the component series, and ^282^ is equal to the sum 
of the squares of the deviations, about M, of the items in the second of the 
two component series. Therefore 

^ ' ■ N ■ 

= NiSi:^ + N282\ ( 1 ) 

= cTj^ 4 " and 82 ^ = 0-2^ -b 
therefore « iVi(en^ + cf-) + (2) 

In the jiresent case we have the major series with mean represented by Mp, 

and a number of component series (the items arranged by columns) with means 
represented by niyi, etc. Let Saj/ represent the standard deviation of any column 
of about the mean of that column. Then we have a number of component 
series, with standard deviations Sayu etc., and with means differing from the 
of all the y’s by My — myi, etc. Substituting in equation (2), we have 
(Ty^ - -f (My — 7i2l8ay^^ + (My — niy^y] + . . . (3) 

+ (My ~ my)% (4) 




for, in eacii column, 


since d represents a deviation from the mean of that column. For all columns, 

Substituting in equation (4) 

N<Ty^ - 4 . 'Zn(My - my)\ ( 5 ) 

definition of the standard deviation of the means of the columns 
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Substituting for cr„j,^ in the equation 



we secure 



Since a my may be much more easily determined than ffay 
the value of is generally computed from this formula. 
The data of Table 104 may be used to exemplify the process. 
Calculations appear in Table 106. 


Table 106 

Illustrating the Com'putation of the Correlation Ratio 


Type of array 
{X-vakie of 

Mean value 
of Y -items 

Deviation 
from mean 

Square of 

Fre- 


I f If 

cohinin) 

(pounds) 

in column 
(bushels) 

of all Y\s 
(25.005) 

deviaUon 

quency 



my 

d 

(D 

f 

fcP 

10 

5.05 

- 19.955 

398.202 

21 

8,362.242 

30 

15.12 

- 9.885 

97.713 

25 

2,442.825 

50 

24.40 

~ .605 

.366 

30 

10.980 

70 

28.73 

+ 3.725 

13.876 

44 

610.544 

90 

31.73 

+ 6.725 

45.226 

37 

1,673.362 

110 

32.40 

4- 7.395 

54.686 

20 

1,093.720 

' ' . 130 

32.00 

+ 6.995 

48.930 

8 

391.440 

,150 

33.33 

+ 8.325 

69.306 

6 

415,836 

170 

34.00 

+ 8.995 

80.910 

2 

161.820 

. Total. ' 




193 

15,162.769 



, /15, 162. 769 
~ V 193 





= 8.864. 
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.Substituting the given values in the fomaula 



we have 

_ 8.864 
9.188 
= .965. 

The process of computing the correlation ratio may be 
briefly summarized: 

1. Arrange the items in the form of a correlation table. 

2. Find the arithmetic mean of all the F-items in each column 

(i.e., find the arithmetic mean of each F-array of type X). 

3. Compute the arithmetic mean of all the F’s. 

4. Measure the deviation of the mean of each column from the 

mean of all the F’s. Square each of these deviations and 
multiply by the number of items in the given column. Get the 
sum of the squared de\dations. 

5. Divide this sum by the total number of items and extract the 

square root of the result. This gives the value of cxmy. 

6. Compute ffj,. 

7. Divide <rmu by <^y The quotient is 

The value of the correlation ratio of X on Y may be 
similarly computed, substituting the proper values in the 
formula 


The symbol a-^x represents the standard deviation of the 
means of the various rows about the mean of all the X’s. 
The value of the correlation ratio of X on F depends upon 
the amount of scatter (horizontally) about the line joining 
the means of the rows. Its value will generally be different 
from that of the correlation ratio of Y on X. In the present 
case the value of is found to be .824. As the line of 
relationship approaches the linear form the two correlation 
ratios approach identity. 
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Like r, 57 can never exceed 1, this value being secured 
when there is no dispersion about the line Joining the 
means of the columns (or rows). From the formula 



it is evident that the value of the correlation ratio is zero 
when (Tmv is zero. This is the case when the mean of each 
column has the same value as the mean of all the F’s. 
Such a condition is found when an increase or decrease in 
the value of the X-variable brings no corresponding change 
in the value of the F-variable. This means that in each 
column of the correlation table there is a distribution of 
cases similar to the general distribution of F’s. When 
this is true there is clearly no relation between the two 
variables. 

The correlation ratio, it should be noted, never has a 
negative value. It is possible to determine by inspection 
of the correlation table, however, whether the relation 
between two variables is direct, or inverse, or a varying one. 

The coefficient of correlation has one distinct advantage, 
as compared with the correlation ratio, in that when its 
value and the values of the two standai’d deviations are 
known the equations to the lines of regression may be 
readily determined. This is not true of 77. To get a quantita- 
tive expression for the “law” of relationship between two 
variables, when rj has been computed, an additional calcula- 
tion for the purpose of fitting a curve to the means of the 
arrays would be necessary. 

COBRBCTION OF THE COKKEEATION EATIO 

The use of y is only possible when the data are numerous, 
and can be arranged in the form of a correlation table. If 
a limited number of items should be so arranged, and it 
chanced that there was but one item in each column, the 
two measures cr„.„ and (Xy would be identical and 77 would 
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necessarily have a value of 1. Computed from a very small 
number of cases and employing a large number of classes, 
the correlation ratio would be meaningless. 

The raw correlation ratio may be corrected by the method 
employed on a preceding page for the index of correlation, 
with m set equal to the number of groups (i.e., to the num- 
ber of columns, for to the number of rows for t?*,,). 
Thus, if rj be the corrected value, we have 




\ 


In the present instance 
= 1 - / 




\ 

= .9276 
V = .963. 


(1 




The correction is very slight in the present case, but if 
N were small or m very large it would reduce the given 
value materially. 


EBLATION BETWEEN THE CORRELATION RATIO AND THE 
COEFFICIENT OP CORRELATION 

When the relation between two variables is absolutely 
linear the line running through the means of the columns 
corresponds, of course, to the line upon which the coefficient 
of correlation is based. When this is the case t} and r have 
the same value. As the relationship between the two 
variables departs from the linear form the values secured 
for 7) and r differ, 13 being always greater than r. This 
results from the fact that the scatter about a line joining 
the means of the columns will always be less than the 
scatter about a straight line fitted to these points, except 
when the straight line passes through every mean point. 
And the less the scatter about the line expressing the 
average relationship the greater the value of the measure 
of correlation. Thus for the alfalfa problem it was found 
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that r has a value of + . 69, and that an index of correlation 
based upon a second degree parabola has a value of .80. 
The correlation ratio for the same material is .82. For 
the data of Table 104 the value of (uncorrected) was 
found to be .965; the value of r is + .793, the difference 
between the two being marked. The reason for the difference 
is found in Fig. 83, in which the straight line of regression 
of F on X and the line joining the means of the columns 
are shown. The regression departs materially from linearity, 
and the scatter about the straight line of regression is much 
greater than the scatter about the line joining the means. 

The relation between r and ij affords a convenient test 
of linearity in a given instance, since the two values will 
be identical when the regression is strictly linear, and will 
differ the more widely the greater the departure from the 
linear form. The general test for linearity is 

f = ■>?“- r^- 

Even in a case of linear regression it is probable that rj 
and r will differ somewhat because of fluctuations due to 
chance alone. A material difference, as reflected in the 
magnitude f (zeta), indicates that a straight line does not 
describe the relationship in question and that r is not a 
suitable measure of correlation. In the example given 
above, in which ?? equals .965 and r equals .793, the 
measure f has a value of .302. (The uncorrected tj is used 
in this test.) This is large enough to indicate that the regres- 
sion is non-linear. 

In later sections methods of testing for linearity are 
more fully discussed. 
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CHAPTER XIII 


ELEMENTARY PROBABILITIES AND THE 
NORMAL CURVE OF ERROR 

Reference has been made in an earlier section to the 
family resemblance which is found among frequency distri- 
butions drawn from widely different fields. Attention was 
also drawn to a certain basic type, represented graphi- 
cally by the symmetrical bell-shaped curve, which is called 
the “normal curve,” or the “normal curve of error.” In 
an earlier day this curve was looked upon as representing 
a fundamental law which described all distributions of 
quantitative data. From the modern standpoint this was 
quite an erroneous conception. The normal curve is viewed 
today as but one of a number of types of curves whicjh 
may be used to describe frequency distributions. It is, 
however, by far the most important type. For many of 
the measm-ements used to describe distributions of observa- 
tions (measurements such as the mean, the standard devia- 
tion, the coefficient of variation) are distributed in accord- 
ance with this normal law of error. The procedures employed 
in generalizing results obtained from the study of samples 
and, in particular, in determining the reliability of such 
generalizations, lean heavily upon this law. An under- 
standing of the characteristics of the normal curve is 
essential to the statistician. 

Elementary Theorems in Probability 

We may approach this subject by a brief consideration 
of certain elementary principles of probability that enter 
into many forms of statistical w'ork. A detailed explanation 
of the theory of probability would carry us beyond the 
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limits of the present volume. The treatment which follows 
is presented only as an introduction to the subject, designed 
to illustrate, by simple numerical examples, the relation 
between the principles of probability and the normal law 
of error. 

In this argument we may use the following standard 
notation. If an event can occur in n ways, a of which are 
, to be considered as successful and b as unsuccessful, the 

^ probability p of a successful outcome may be written 


a 



and the probability g of an unsuccessful outcome may be 
written 


Since the sum of the favorable and unfavorable outcomes 
is equal to the total number of events, we have 

^ ^ ; a ^ ^ ^ ^ 

Dividing by n. 


so that 

p + q= 1 

or certainty. 

A probability, therefore, may be written as a ratio. The 
numerator of the fraction corresponding to this ratio repre- 
sents the number of favorable (or unfavorable) outcomes, 
while the denominator represents the total number of 
possible outcomes. 


EXAMPLES OF SIMPLE PROBABILITIES 

If a coin be tossed, the turning up of a head being looked 
upon as a favorable outcome, we have, as the probability 
of a success, 
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and of a failure, 



If we roll a die, regarding a six spot as a favorable outcome. 


1 



If a card be drawn from a pack of 52 the chance of drawing 
the ace of spades is -h, of failing in that endeavor, fi. 


THE ADDITION OP PROBABILITIES 

What is the chance of securing either an ace of spades 
or a two of spades in a single draw from a pack of 52 cards? 
In such a case, where any one of several outcomes will be 
considered as favorable, the probability of a success is 
the sum of the separate probabilities. In this example 

= JL 4 . 1 _ 1 
^ 52 52 26' 

The chance of drawing either a heart or a spade from a 
pack of playing cards is given by 

13 , 13 1 

^ 52 ■*' 52 2 

THE MULTIPLICATION OP PROBABILITIES 

Two events are said to be independent when the outcome 
of one does not affect the outcome of the other. Thus the 
result of one throw of a die does not, presumably, affect 
the result of the next toss. The probabihty of a compound 
event (i.e., that two events, independent of one another, 
will both occur) is the product of the probabilities of the 
separate events. Thus the chance of securing an ace, 
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followed by a two spot, in two successive thi-ows of a die, 
is given by 

1 sx 1 _ 1 

^ 6 ^ 6 36' 

In computing the probability of a given outcome it is 
frequently necessary both to multiply and to add probabili- 
ties. For example, we wish to determine the chance of 
securing the total 5 from two dice thrown simultaneously. 
We may label the dice a and b to distinguish them. This 
total may be secured from any one of the four following 
combinations : 

Die a Die 5 

1 4 

2 3 

3 2 

4 1 

The chance of securing an ace with die a is i, of secur- 
ing a 4 with die b is The chance of the two in combi- 
nation is Similarly, the probabihty of each of the other 
three combinations is But any one of these four re- 
sults will give a total of 5, and will be considered success- 
ful. Hence 

_ 1. , i_ , JL , = l 

^ “ 36 36 36 36 9" 

We have in this example answered the question: What 
is the probability of securing exactly 5 in the toss of two 
dice? We might put the question: WTiat is the chance of 
securing at least 5 in the toss of two dice? In this case a 
total of 5 or more wiU be considered a favorable outcome. 
Just as in the preceding example, we may work out the 
probability of securing each of the results which will be 
accepted as successful. The following summary indicates 
the probability of each of these totals: 
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Sum of above probabilities = 

oo 

The chance of throwing at least o in the toss of two dice is, 
therefore, M or f. 

The Binomial Expansion and the Measurement of 
Probabilities 

It is possible to express these facts in a generalized form. 
A simple illustration may be employed to exemplify the 
derivation of the general expression. 

If two coins are tossed simultaneously there are four 
possible outcomes 

ah ah abab 
TT TH HT HH. 

(The two coins are represented, respectively, by the letters 
a and b.) The chances of securing no heads, one head, and 
two heads are, respectively, 1, i, and i. If three coins 
(represented by the letters, a, b, and c) are tossed simul- 
taneously, we have eight possible outcomes 

a b c a h c a h c a b c a b c a b c a b c a he 

TTT TTH THH THT HTT HTH HHT HHH. 
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The chances of securing no heads, 1 head, 2 heads, and 
3 heads are, respectively, i, f, f, i- 

But these results may be derived without working out 
the separate probabilities in detail. We have employed 
p and q to represent, respectively, the probability of success 
and failure of a given event. If there are two independent 
events the compound probabilities are given by the expansion 
of the expression 

(p + q)\ 

For the case in which p (e.g., the probability of throwing 
a head) =3 = 5, the probabilities of the various results 
are given by 


These are the results secured in the first example cited 
in this section. If there are three independent events, 
with p = q = i, we have 


the probabilities secured in the second example. 

If we wish to know not the separate probabilities but the 
probable frequencies of the various outcomes in a given 
number of trials, these may be computed from the expression 

N{p + g) ” 

where N represents the number of trials and n the number 
of independent events. Thus if there are 200 trials and 
there are two independent events, the probable frequencies 
are given by 

200 (p + qy == 200(p2 + 2pq + q^). 

With p = q ~ this gives us 
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which indicates the probable frequencies of 2 successes, 
1 success, and no successes. 

If there are three independent events, the probable 
frequencies in N trials are determined from the binomial 
expansion of 

Nip + q)\ 

If N equals 200, we have 

200(p3 + Zj)^q + 3pq^ + q^). 

If p equals i, we have 

20o(i) + 200(|) + 200(|) + 200(1). 

These terms indicate, in order, the probable frequencies 
of 3 successes, 2 successes, 1 success, and no successes. 
The total frequencies secured by carrying through the 
process of multiplication will be equal to the number of 
trials, for all possible outcomes are covered by the expansion. 

Thus, when we know in advance^ the probabilities attach- 
ing to similar but independent events, we may determine the 
probable frequencies of any given number of successes or 
failures. This is true whether p and q be equal or unequal. 
It is necessary only that p and q remain constant. There 
is here a fact of great significance in the development of 
statistical theory. 

A COMPABISON OF ACTUAL AND THEORETICAL FREQUENCIES 
IN THE REALM OF PURE CHANCE 

Certain points of importance may be made clear by 
comparing some experimental results with the theoretical 
frequencies given by the binomial expansion. Twelve dice 

^ A distinction is generally drawn between a priori probabilities of the type 
described above, and enipiriml probabilities, knowledge of which i>s derived 
from observation or experience. As an example of the latter type we have, 

74 173 

as the probability that a man aged 35 will live 10 years, the ratio 

is based upon the American lilxperience Table of Mortality whicli shows that 
of 81,822 men living at the age of 35, there are 74,173 living ten years later. 
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were thrown a number of times. Each 4, 5, or 6 spot 
appearing was considered to be a success, while a 1, 2, 
or 3 spot was a failure. (In a typical throw we might have 
the following spots up: 3, 1, 5, 1, 2, 4, 4, 6, 3, 2, 3, 5. In 
this lot there are five successes, and the result is so tallied.) 
In a classical example recorded by W. F. R. Weldon^ 
twelve dice were thrown in this way 4,096 times, a success 
being defined as above. The results are recorded in column 
(2) of Table 107, and the distribution is shown in Pig. 84. 
By computation we find the arithmetic mean and the 
standard deviation of this distribution to be, respectively, 
6.139 and 1.712. 

Let us compare with these results those which we might 
expect from the given conditions. Twelve dice were thrown 
each time, hence we are dealing with 12 independent 
events. There were 4,096 trials. Since either a 4, 5, or 6 is 
considered a success, p = ^ = i . 

For the terms in the binomial expansion we have 

(p + qy = p" + ri'p’-hj + 


In the present case we have 


Expanding 


4,096 ^ 4,096 ^ 4,096 4,096 4,096 4,096/ 

Completing the indicated multiplication we have the theo- 
retical frequencies of the various possible successes in 
4,096 throws of twelve dice. These are shown in column 
(3) of Table 107. 

* Cited by F. Y. PMgeworth, Encyel. Brit., lltli ed., Vol. XXII, .394. 
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Table 107 

Comparison of Actual and Theoretical Frequencies in Dice-Rolling 

Experiment 


(1) 

(2) 

(3) . 

Number of 

Observed 

Theoretical 

successes 

frequencies 

frequencies 

0 

0 

1 

1 

7 

12 

2 

60 

66 

3 

198 

220 

4 

430 

495 

5 

731 

792 

6 

948 

924 

7 

847 

792 

8 

536 

495 

9 

257 

220 

10 

71 

66 

11 

11 

12 

12 

0 

1 



4^96 


The distribution of the theoretical frequencies is shown 
in Fig. 84, with that of the observed frequencies. The 
relationship of the two distributions is close. 

When we have, as in this case, a knowledge of the 
probabilities involved, it is possible to determine the arith- 
metic mean and the standard deviation of the distribution 
of the theoretical frequencies. As a general expression for 
the mean munber of successes, where the number of inde- 
pendent events and the probability of success are known, 
we have 

M = np. 

Applying the present values, 

M = 12xi = 6. 

The mean, as computed from the observed frequencies, is 
().139. 
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ii. 


M 


As a general expression for the standard deviation/ under 
the same conditions, we have 

a = Vnpq. 

In the present case 


4/12 xj 


= 1.732. 


Theoretical 
Frequency 


iX2 = ^3 


/ \v Actual Frequency 



U 1 2 3 4 5 6 7 8 9 10 11 12 

Number of Successes 

Fig. 84. — A Comparison of Actual and Theoretical Frequencies in a 
Dice-Eolling Experiment 

The standard deviation, as computed from the actual 
frequencies, is 1.712. 

When proportions, or relative frequencies, are dealt with, 
the standard deviation (a') may be derived from the relation 




^ This formula for the standard deviation of a binomial distribution is of 
central importance. The derivation of this formula, and that for f.he mean of 
a binomial distribution, are given in Appendix B. 
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The Normal Curve of Error 

We may return to a consideration of the curve in Fig. 84 
which represents the theoretical frequencies in the dice- 
throwing experiments. It is a perfectly synGonetrical 12-sided 
polygon, the number of sides (excluding the base) corre- 
sponding to the number of independent events in the particu- 
lar problem considered. With six events we should have a 
six-sided figure, with twenty events a twenty-sided figure, 
and so on. It is obvious that, as n increases, the number 
of sides to the polygon increasing correspondingly in num- 
ber, the graph representing the expansion of the binomial 
(p -+■ g) ” approaches more and more closely a smooth curve. 
With n infinitely large a perfectly smooth curve would be 
secured. This is the normal curve of error which has been 
plotted in Fig. 85. 

The equation to this curve is wTitten in several forms, of 
which 

_ *1 
20-2 

y = yfi 

is one. In this equation y^, the maximum ordinate, is a 
constant; e is a constant (the base of the Napierian loga- 
rithms) having a value of 2.71828; <r represents the stand- 
ard deviation; and x is a given value of the dependent 
variable expressed as a deviation from the mean. The maxi- 
mum ordinate may be derived from the relation 


N 

hence the equation to the normal curve may be written 


y = 


N 

o'\/2t 



where IT is the constant 3 . 14159. 

This equation may be derived in several ways.^ One 

* Gauss’ deduction of the error equation may be found in all standard works 
on the theory of least squares. Cf. references at end of Appendix A. 
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procedure which throws light on the physical conditions 
giving rise to the emergence of a normal distribution, starts 
from three basic assumptions. 

1. The causal forces affecting individual events are 
numerous, and of approximately equal weight. 

2. The causal forces affecting individual events are 
independent of one another. 

3. The operation of the causal forces is such that devia- 
tions above the mean of the combined results are balanced 
as to magnitude and number by deviations below the mean. 

A great part of the power which modern statistical 
technique possesses is derived from the detailed knowledge 
of the characteristics of the normal or Gaussian curve. 
From prepared tables showing the fractional parts of the 
total area under the curve lying between ordinates erected 
at stated distances from the maximum ordinate, theoret- 
ical frequencies may be determined much more readily 
than by the laborious method based upon the binomial 
expansion. 

USE OP A TABLE OP AEEAS UNDER THE NORMAL CURVE 

The entire area under a frequency curve is taken to 
represent the total number of frequencies. Given information 
as to the proportion of the total area within a given segment, 
it would be easy to compute the frequencies represented 
by this segment, or to determine the probability that a 
given observation from the population represented by the 
curve would fall within the limits of this segment. Prepared 
tables of the probability integral, of which Table 108 is an 
example, serve just this purpose, with respect to the normal 
curve. (A more detailed table than that here given is 
needed, for accurate computation. Appendix Table I will 
serve most purposes. 0 

’ Tables of areas under the normal curve, as calculated by Dr. W. F. Shep- 
pard, are available in many publications. Cf. Tabks for SMistieiam and Biu- 
m^ricians, edited by Karl Pearson, Biometric Laboratory, University College, 
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Table 108 

Area of the Normal Curve in Terms of Abscissa 
(Giving fractional parts of the total area between y, and ordinates erected 
at varying distances from y^) 


xja 

a 

xjcr 

a 

0.0 

.00000 

2.0 

.47725 

0. 1 

.03983 

2.1 

.48214 

0.2 

.07926 

2.2 

.48610 

0.3 

. 11791 

2.3 

.48928 

0.4 

, 15542 

2.4 

.49180 

0.5 

.19146 

2.5 

.49379 



2.5758 

.49500 

0,6 

.22575 

2. 6 

.49534 

0.7 

.25804 

2,7 

.49653 

o.s 

.28814 

2.8 

.49744 

0.9 

.31594 

2.9 

,49813 

1.0 

.34134 

3.0 

.49865 

1.1 

.36433 

3.1 

.49903 

1.2 

.38493 

3.2 

.49931 

1.3 

.40320 

3.3 

.49952 

1.4 

.41924 

3.4 

.49966 

1 . 5 

.43319 

3.5 

.49977 

1.6 

.44520 

3.6 

.49984 

1.7 

.45543 

3.7 

.49989 

1.8 

.46407 

3.8 

.49993 

1.9 

.47128 

3.9 

.49995 

1.96 

.47500 

4.0 

.49997 


Since the normal curve is symmetrical about the maxi- 
mum ordinate, the values given in Table 108 apply to 
observations on both sides of the mean. In using such a 
table, deviations from the mean are first expressed in units 
of the standard deviation. (The term normal deviate is 
applied to such a quantity, that is, to a deviation from the 
mean of a normal distribution expressed in units of the 
standard deviation of that distribution.) The proportion 

London; Tables of Applied Mathematics, J. W. Glover, Ann Arbor, Michigniii, 
George Wahr; Manual of Problems and Tables in 8fMisfia% F. C. Mills and 
I). H. Davenport, New York, Henry Holt and Co, 
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of the total area lying between any two ordinates may then 
be readily determined. For example; What proportion of 
the cases in a normal distribution lies between the maximum 
ordinate and an ordinate erected at a distance from the 
mean equal to + o'? Reading down the xla colunan to 1.0, 
we find the value . 34134 opposite it. This, in ratio form, 
is the proportion of cases falling within the limits indicated. 


— 3(J — 20 — <5 6 +0 ~¥2.(S +3(? +4rf 

Fig. 85. — An Illustration of the Measurement of Areas under the Normal 

Curve 


Expressing this ratio as a percentage, we have 34.134 per 
cent as the answer to our question. 

Fig. 85 shows the relation of this area (the shaded area A) 
to the total ai'ea under the curve. 

What proportion of the total number of cases in a normal 
frequency distribution will fall between an ordinate erected 
at a distance from the mean equal to — 1.4cr and one 
erected at — 2o-? From the table we find that 41.924 per 
cent of the total area will lie between and the ordinate 
at - 1.4(r; 47.725 per cent will lie between and the 
ordinate at — 2<t. The difference, 5.801 per cent, ■will fall 
between the ordinates at — 1.4o- and at - 2cr. This may 
be converted into actual frequencies by taking this proper- 
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tion of the total number of cases in the given distribution. 
The shaded segment B in Fig. 86 represents the area thus 
marked off. 

For certain purposes we wish to know the proportion 
of the total number of cases deviating by a stated amount 
or more in either direction from the mean of a normal 
distribution. If we wish to know the proportion of all 
cases deviating from the mean by 1 . 96ff or more, we must 
add to the area between + 1 • 96cr and the upper limit of 
the curve the area between — 1 . 96cr and the lower limit of 
the curve. Each of these areas equals .50000 - .47500, 
or . 025. The percentage of cases deviating from the mean 
by + 1.96cr or more is 2.5; the percentage deviating by 
— 1.96o- or more is 2.5. The percentage deviating above 
or below the mean by 1.96cr or more is 5.0. Similarly, 
it may be determined from the enti’ies in Table 108 that 
just one per cent of all the cases in a normal distribution 
will deviate from the mean, positively or negatively, by 
2.5758a-, or more. This “one per cent” area is represented 
by the sum of the shaded portions at the two tails of Fig. 85. 
The ordinates defining the inside limits of these segments are 
erected at + 2.5758(r and at — 2.5758cr, while the outer 
limits are at infinity. 

Special significance attaches to the two limits last men- 
tioned, because of the uses made of them in interpreting 
errors of sampling. This topic is developed at a later point. 
Here we may note that the figures defining proportions of 
the total area under the normal curve falling in given 
areas may also be interpreted as probabilities. The proba- 
bility that a given observation, made at random in a popu- 
lation distributed according to the normal law of error, 
will fall between the mean and a value one standard de\da- 
tion above the mean is .34134; the probability that a given 
observation will deviate from the mean by 1 . 96<r or more 
is . 05 ; the probability that a given observation will deviate 
from the mean by 2.5758cr or more 






The method by which probabilities of occurrence may 
be determined from a table of areas under the normal 
curve, and by which the significance of a given normal 
deviate may be established, should be clearly xmderstood. 
These methods enter in many ways into the work of a 
statistician. 

The uses of the normal curve of error, and of the table 
of areas based upon the integration of this curve, are too 
varied to be enumerated at length here. A simple example 
may serve to introduce the subject. 


An Economic Application 

The statistical division of the American Telephone and 
Telegraph Company has made a study of the annual 
message use of four-party fine residence message rate sub- 
scribers in Buffalo. The annual messages for each of 995 
subscribers were tabulated and classified. 1 The results, 
together with certain computations, appear in Table 109. 


THE MOMENTS OF A FREQUENCY DISTRIBUTION 

Some terms and symbols that have not been employed 
heretofore may be introduced at this point. We may 

write, using v (nu) to define certain quantities of interest 
to us. 


Vi 


W) . ; 

^ - tost moment of the distribution about the arbitrary 


*'2 


N 


ongm. 


N 


N 


= second moment of the distribution about the arbi- 
trary origin. 

= third moment of the distribution about the arbi- 
trary origin. 


fourth moment of the distribution about the arbi- 
trary origin. 


phonft and Telegraph Co. ^ Statistician, American Tele- 


440 THE NORMAL CURVE OF ERROR 


MOMENTS OF A DISTRIBUTION 441 


Table 109 


Annual Message Use of 995 Telephone Subscribers 
(Illustrating the computation of the moments of a frequency distribution) 


(1) 


(2) 

(3) 

(4) 

Deviation 

(5) 

(B) 

(7) 

(8) 

Interval 

Mwl- 

Fre- 

ffvm arbi- 





of message 
use * 

point 

quency 

trary origin 
in class-in- 









terval wiits 







m 

f 

x' 

fx^ 

fixr 

f(xO^ 


0— 

50 

25 

0 

- 10 

0 

0 

0 

0 

50- 

100 

75 

1 

~ 9 

- 9 

81 

- 729 

6,561 

100- 

150 

125 

9 

- 8 

~ 72 

576 

- 4,608 

36,864 

150- 

200 

175 

19 

- 7 

- 133 

931 

- 6,517 

45,619 

200- 

250 

225 

38 

- 6 

- 228 

1,368 

- 8,208' 

49,248 

250- 

300 

275 

50 

- 5 

- 250 

1,250 

- 6,250 

31,250 

300- 

350 

325 

95 

- 4 

- 380 

1,520 

- 6,080 

24,320 

350- 

400 

375 

85 

- 3 

255 

765 

- 2,295 

6,885 

400- 

450 

425 

115 

- 2 

- 230 

460 

- 920 

1,840 

132 

450- 

500 

475 

132 

- 1 

- 132 

132 

- 132 

500- 

550 

525 

144 

0 

0 

0 

0 

0 

550- 

600 

575 

116 

1 

116 

116 

116 

116 

600- 

650 

625 

79 

2 

158 

316 

632 

1,264 

650- 

700 

675 

54 

3 

162 

486 

1,458 

4,374 

700- 

750 

725 

31 

4 

124 

496 

1,984 

7,936 

750- 

800 

775 

11 

5 

55 

275 

1,375 

6,875 

800- 

850 

825 

5 

6 

30 

180 

1,080 

6,480 

850- 

900 

875 

6 

7 

42 

294 

2,058 

14,406 

000- 

950 

925 

2 

8 

16 

128 

1,024 

8,192 

950-1,000 

975 

1 

9 

9 

81 

729 

6,561 

1,000-1,050 

1,050-1,100 

1,025 

1 

10 

10 

100 

1,000 

10,000 

1,075 

1 

11 

11 

121 

1,331 

14,641 




995 


- 956 

9,676 

- 22,952 

283,564 


“Moment” is a familiar mechanical term for the measure 
of a force with respect to its tendency to produce rotation. 
The strength of this tendency depends, obviously, upon the 
amount of the force and the distance of the point at which 
the force is exerted from the origin. The term is used in sta- 

As here classified an item having a value of 50 was put in the class having 
50 as an upper limit. Items falling on other class limits were similarly disposed 
of. 
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tistics in a quite analogous sense, the class-frequencies being 
looked upon as the forces in question. The size of each class- 
frequency and the distance of each class noidpoint from the 
origin are the factors of prime importance in this respect. 
The moments of a distribution about any origin may be 
computed by multiplying the frequency of each class by 
a given power of its distance, along the a:-axis, from the 
origin, summing the resulting products and dividing by 
the number of cases. If the first moment is desired, the 
first power of the a;-distance is employed; if the fourth 
moment, the fourth power of the a;-distance, etc. The 
subscripts indicate the moments represented by the various 
symbols. 

The most significant moments, for statistical purposes, 
are those which relate to the arithmetic mean as origin. 
Representing these moments by tt (pi)^ we have the 
following relationships: 

First moment about the mean = tti = 0. 

Second “ “ “ “ = TTj = 

Third “ “ “ “ + 2vy\ 

Fourth “ “ “ “ = Tn - Vi — iviVi -f- Qvi^Vi — dvi\ 

The computation of these moments from the data, as 
classified, involves the assumption that the items in each 
class can be treated as though they were concentrated at 
the naidpoint of that class. It has been established that, 
under certain conditions, calculations made on this assump- 
tion are subject to a constant error. In particular, it has 
been shown that the values of the second and fourth 
moments are not the same, when computed from grouped 
data, as when computed from ungrouped data. 

W. F. Sheppai'd^ has worked out certain corrections for 
this bias. His corrections may be applied when two 
conditions prevail: 

1 In the equation to the normal curve v represents the familiar constant, 
B. 14159. As a symbol for a moment about the mean it relates, of course, to 
no such constant value. 

® Cf. Proceedings of the London Mathematical Society ^ VoL XXIX, 353--380. 
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(1) When the distribution relates to a continuous variable. 

(2) WTien the frequency curve is characterized by “high 
contact, ” i.e., when the frequency curve tapers off gradually 
in both directions. 

The symbol /si (mu) is employed to represent a corrected 
moment about the mean. The application of Sheppard’s 
corrections gives us the following final formulation: 

Ml = 0 

1 

M2 = ffa — 

/i<3 ^ 3 

(In applying the corrections xV and the correspond- 
ing decimal values, .083333 and .029167, will generally 
be employed.) It is assumed in making these corrections 
that a class-interval unit has been employed in measuring 
deviations from the mean. 

It may be noted in passing that the standard deviation is 
the square root of the second moment about the mean. For 
the uncorreeted value, 

<T — y/iTi. 

If Sheppard’s corrections^ are to be applied 

<r = Vm2. 

The calculation of the moments of the frequency distribu- 
tion of telephone subscribers is shown on page 444. Shep- 
pard’s corrections are applied, since the curve is marked by 
reasonably high contact. It is a discontinuous distribution, 
but the unit (1) is so small in comparison with the range 
that it may be treated as continuous. 

^ It should be noted that these corrections, when appropriate, are applicable 
to the standard deviations entering into the calculation of the coefficient of 
correlation. 
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n 


956 


995 
9,676 

= “995 

- 22,952 
288,564 _ 


Vi = 


995 


= - .960804 
9.724623 
= - 23.067337 
284.988945 


ITi = 0 
TTj — V2 

Ts = Vs 


Vi^ = 9.724623 

■ 3l'!»'2 + 2l'l“ = - 


.923144 = 8.801479 
23.067337 + 28.030370 
- 1.773922 

n = vt — ivivs + Oj'jVa — 31*1 ^ 

= 284 . 988945 - 88 . 652760 + 53 . 863384 - 2 . 556586 
= 247.642983 

jU] = 0 

to = r.. - -4 = 8.801479 - .083333 = 8.718146 
12 

TT..) == 3.189111 

247.642983 - 4.400739 +- .029167 


3.189111 


Mi 


* L " 

^2 + 


ITi 

243.271411 


CHITERIA OF CURVE TYPE 

Having these values, we may return to a consideration 
of the main jM-olrlem, the utilization of our knowledge of 
the normal curve. There are certain criteria, represented 
by the letters (beta) and k (kappa), which enable us 
to (let ermine readily whether a given distribution may be 
descriljed by a curve of the normal type. These may be 
derived from the corrected moments of the given distribution. 

10.170429 


A 


^2 


Ms~ 
M'/ 
. M4_ 


662.632015 
2 43.27141 1 ^ 
76.W6070 ~ 

A(A + 3)=* 


.01534853 


3.200683 


4(4A - 3/8i)(2A - 3/3, ~ 6) 
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.01534853 X 38.448470 _ .5901275 
4(12. 756686) (.355320) 18.130823 

K2=. 032548 

For the normal curve these criteria have the following 
values : 

A = 0 
ft = 3 

Ki = 0 

We may conclude, tentatively, that the normal curve 
may be used to describe the given distribution. ‘ 

Fitting a Nobmal Curve; Use of a Table of Areas 

The process of fitting a normal curve to a set of observa- 
tions involves the computation of theoretical frequencies 
corresponding to the observed frequencies. This may be 
done from a table of areas under the normal curve (see 
Appendix Table I). Using such a table, in the manner indi- 
cated in the preceding section, the areas between the maxi- 
mum ordinate and ordinates erected at the various class 
limits may be determined. By the simple process of subtrac- 
tion the area within each class, and hence the theoretical 
frequencies, may then be computed. The procedure is illus- 
trated in Table 110 on page 446, relating to the distribution 
of telephone subscribers. 

The theoretical distributions derived from this fitting 
process may be compared with the observed frequencies, 
as given in Table 109. Or the comparison of the actual 
distribution and the fitted curve may be made graphically, 
as in Fig. 86. It is apparent by inspection that the normal 
curve gives a fairly good fit to the data, although there 
are several classes in which the differences are marked. A 
natural question arises as to the reason for the failure of 
the normal curve to fit at all points. There are two possible 

^Account is later taken of the bearing of errors of sampling on this eon- 
elusion. See Chap. XVIIL 
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Table 110 

lUuaraMng the CompuMion of Theoretical FrequemieB from a Tohl 

of Areas 

« (2) (3) u, 


Class 

limit 


0 
50 
100 
loO 
200 
2,50 
300 
350 
400 
450 
500 
550 

650 
700 
750 
800 
850 
900 
950 
1,000 
1,050 
1,100 


Deviation 

from 

mean 

X 

<r 

- 3.23 

- 2.89 

- 2.55 

- 2.22 
- 1.88 

- L54 

- 1.20 
- .86 
- .52 

.18 


+ 

+ 

+ 

+ 1 
+ 1.51 
+ 1.85 
+ 2.19 
+ 2,53 
+ 2.87 
+ 3.20 
+ 3 . 54 
+ 3.88 
+ 4.22 


Proportion of 
area between ijq 
and ordinate 

all 

4993810 
4980738 
4946139 
4807906 
4699460 
4,382198 
3849303 
30.51055 
1984682 
0714237 
0035.595 
1896931 
2967306 
3789995 
4344783 
4678432 
4857379 
4942969 
4979470 
4993129 
4997999 
4999478 
4999878 


(4) 

Number of 
cases between 

y, and ordi- Theoretical frequences 

ihQ/ZQ "by — 

a<| 

496.88 

495.58 
492. 14 
484.36 
467.60 
436.03 
383.01 

303.58 
197.48 
71.07 
63.24 

188.74 
295.25 
377. 10 

432.31 
465.50 

483.31 
491.83 
495.46 
496.82 
497.30 
497.45 
497.49 


0- 50 

50- 100 
100- ISO 
150- 200 
200- 250 
250- 300 
300- 350 
350— 400 
400- 450 
450- 500 
500- 550 
550- 600 
600- 650 
650- 700 
700- 750 
750- 800 
800- 850 
850- 900 
900- 950 
950-1,000 
1,000-1,050 
greater than 
1,050 


1.92’' 

3.44 

7.78 

16.76 

31.57 

53.02 

79.43 

106.10 

126.41 

134.31 

125.50 

106.51 
81.85 
55.21 
33.19 
17.81 
8.52 
3.63 
1.36 

.48 

.15 


.05 
995.00 

answers to such a question. The failin-P tn u , 

irequenev here been added to the theoretLl 


I 
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accords perfectly with the normal law of error, but the 
particular sample selected may be marked by certain 
irregularities which would be ironed out if a very large 
number of cases were included. On the other hand, the 
differences may be due to the fundamental failure of such a 
distribution to accord with the normal law of error. Such 
a law may not describe the distribution of telephone calls, 
in which case the normal curve should not be employed. 

At this stage we may note, without discussion, that the 
differences between theoretical and observed frequencies in 



Fig. 86. ~ Illustrating the Fitting of a Normal Curve to Frequency 
Distribution of Telephone Subscribers, Classified according to Message 
Use 


the present example are small enough to be attributed to 
chance fluctuations of sampling. The reasoning that sup- 
ports this conclusion is presented in a later section (Chapter 
XVIII). The evidence is clear, however, that the discrep- 
ancies between the observed frequencies and those in the 
corresponding normal distribution are not excessively large. 
The observed facts are not inconsistent with the hypothesis 
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that residential telephone subscribers, classified according 
to frequency of telephone use, are distributed in accordance 
with the normal law of error. 

, This conclusion gives generality to the results of our 
study, have a great deal of information concerning 
he attributes of distributions following the normal law 
of error and once the identification of an actual distribu- 
tion with this standard type has been effected we mav 
draw upon this store of knowledge. In using the original 
requency table we are limited to the classes there estab- 
lished. We may now go beyond this and determine how 
many cases may be expected within stated limits. We mav 
compute the probability of a case falling between any tw^ 
pom s on the x-scale, or above or below any given value 
I he observed results, standing aloms are restricted in their 
higniftcance to the particular observations recorded but 
^ HMheoretical frequencies have no .such limitations. They 
apiiy generally, to the entire population from which the 
.samp e was drawn. In so far as we are assured of the repre- 

mfuuKc lluit would Ik, afforded l>.v no Mount ot study 
_ the, paiticular distribution as a thing apart. This fact 
that a knowledge ot the th,„rctieal Lquencies porS 

STm °th7‘ ohservation, h 

perhaps the most important of the advantages derived 

om the identification of an actual distribution with an 
ideal type, such as the normal distribution.^ 


note on the DESCEIFTION op the peequency disteibution 


iA ^ — ii treat- 

«w, of "f ^^ough 

fiindaifteiitiil types will be fminrf “count of other 

th« end of thte Spi;. ***« Arne Fisher referred to at 
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meiit of this subject is beyond the scope of the present book, but it 
seems advisable to indicate briefly the nature of these additional 
measures. 

The value of ^2 serves as a measure of the degree of flat-topped- 
ness^' found in a given curve. If ^2 = 3, as in the normal type, the 
curve is said to be mesokuriic. If ft < 3 the curve is platykurtic, or 
flatter than the normal type. If ft > 3, as in the example given 
above, the curve is leptokurtic, or more peaked than the normaL 
A measure of skewness which is more accurate than those given 
early in the book may also be computed from these criteria. 
Karl Pearson has shown that the quantity 

VftCft + 3) 

^ 2(5ft 6ft -- 9) 

serves as a measure of the degree of asymmetry of a given curve. 
Inserting the values of I3i and ft given above we have, in the ease 
of the distribution based on message use, 

05558 . 

(x is positive if the mean is greater than the median, negative if 
the mean is less than the median. In the present case the valiu' of 
the mean is 476.96, that of the inedian is 482.39, hence the 
skewness is negative.) 

Finally, the distance, d, between the mean and the mode may 
be determined from the relation 

d = X X <T. 

In the distribution described above (relating to telephone use) cr, 
in original units, equals 147 . 65. Hence 

d = - ,05558 X 147.65 = - 8.21. 

Since ' 

Mo^M -d 

.we.have 

Mo = 476.96 + 8.21 = 485.17. 

This gives a truer approximation to the modal value than any of 
the methods discussed in Chapter IV. 

The methods exemplified in Table 109 and the accompanying 
text provide, therefore, a straightforward procedure for the meas- 
urement of the essential attributes of a frequency distribution. 
The mean and mode as measurements of central tendency, the 



ard deviation as a measure of disp(?rsion, x ai 

loss, and ft - 3 as a measuro of tlio dogrpo of 
son-afions near the point of niaxinuim fri'cin 
it(>d directly from the first four moments of ■ 
methods an* available, of course, whether or 
etyried to the poiiy of determining and fittin 
of an appiopriate ideal type. ’^I'hey an* to bi* 

' in any .sy.stematie study of freijuency distiibu 
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CHAPTEE XIV 


STATISTICAL INDUCTION AND THE PROBLEM 
OF SAMPLING 

The preceding pages have been devoted to an account of 
tools employed in statistical analysis. Examples illustrating 
the application of these tools to specific problems have 
been presented, but the emphasis throughout has been on 
technique. It is appropriate at this point that we st and off 
a distance, enlarging our perspective, and consider certain 
general problems relating to the application of these tools. 
What is their proper place in economic and l)u.siness research? 
What are the lussumptions involved in using them and 
what are tlufir limitations? What are the end products 
of statistical analysis? How valid are the conclusions 
reached? What re.striction.s attach to such conclusions? 
We must give thought to .such quest, it»ns ;is these, if statistical 
methods are to be intelligently applied. 

Statistical Desciuption and Statistical Induction 

In approaching this subject we must first make clear 
the distinction between fstaiistical descripiion and statMical 
mduction. By employing the methods of statistics it is 
po.ssibIe, as we have seen, to describe .succinctly a mass 
of quantitative data. Hundreds or thousands of individual 
cases may be elas.sified, and a frequency distribution formed. 
The essence of this distribution may be boiled down to 
perhaps four measures — of central tendency, variation, 
skewness, and kurtosis. A tremendous gain has licen reali:?!ed 
in thus replacing the multiplicity of individual (!aso.s by a 
limited number of measures that define the characteristics 
df the group as a whole. The posse.ssion of such tools mako.s 
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it possible for our limited powers of perception to grasp 
the significance of facts in the mass. Again, the methods 
of statistics enable us to describe relations between variable 
quantities. By securing the equation to an appropriate 
curve fitted to the data by mathematical methods, we 
may determine how much, on the average, one quantity 
changes in value as a related factor varies. This may be 
supplemented by a measure of the scatter or dispersion 
about the fitted curve, and by a measure, in abstract 
terms, of the degree of correlation between the dependent 
and the independent variables. 

In so far as the results are confined to the cases actually 
studied, these various statistical measurements are merely 
devices for describing certain features of a distribution, or 
certain relationships. Within these limits the measures 
may be used with perfect confidence, as accurate descrip- 
tions of the given characteristics. But when we seek to 
e.xtend these results, to generalize the conclusions, to apply 
them to cases not included in the original study, a quite 
new set of problems is faced. 

The logical process by which one arrives at generalizations 
from a study of particular cases is termed induction, as 
opposed to deduction, which involves the di’awing of special- 
ized conclusions from general propositions. By statistical 
induction or statistical inference is meant the generalization 
of statistical results, the application to a population of 
measurements derived from a sample. We are emplojdng 
this procedure constantly in practical statistical work, 
though not always with a full realization of the assumptions 
inherent in that process and of the limitations attaching 
to it. 

The Nature of Statistical Induction 

The problem at issue in considering the validity of 
.statistical induction may be put in the following form; 
A statistical measurement --- an average, a frequency ratio. 
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a coefficient of correlation — has been derived from the 
stody of sample data drawn from a given population 
(The term “population” refers to a complete universe of 
things or phenomena having stated characteristics in com- 
mon.) May we assume that, if additional samples were 
taken from the same population, the corresponding measure-' 
ments would have the same values? If not, may we deter- 
mine the approximate limits to the fluctuations to be 
expected in these measures, as derived from successive 
samples? Here, obviously, is a problem of supreme impor- 
tance. Karl Pearson has called it “the fundamental problem 
of practical statistics.” If we cannot be assured of a certain 
degree of stability in the results secured from successive 
samples it would be quite invalid to generalize from the 
examination of a limited number of eases. No weight 
would attach to any study except, one covering the entire 
universe of things or phenomena composing the given 
population. Vet such all-inclusive studies of economic 
phenonu'na am practh-ally impossible. Index numbers of 
prices, of wages, of living costs, equations describing the 
relatnm between the production and prices of given com- 
modities, coefficients of correlation between temperature 
and crop yield ^ all must of necessity be based on the 
study of samples The problem of statistical inference, 
in the words of Oskar Anderson, is that of so utilizing the 
samples aa to arrive at the best possible approximation 
to the characteristics of the omverse. 


i ..f. lI !■'. 



STATISTICAL INDUCTION 455 

the cases to be covered by the conclusions are included in 
the observations, the conclusion ceases to be an induction 
and becomes a descriptive statement. Accordingly, although 
induction is a highlv fruitful ^ ^ 1 

knowledge, it fe always hazardous. A le^ . 7 the“ 

™ to case, no: 

_ The justification for this leap in the dark, and this is 
at theie IS a, hmitation to the amount of independent 

nT™ holt™: 

Mmo uniformity in all natural proceKes. When we are 

und in the stability of large numbers, as exemplified bv 

-eh Phenomena as bS rate^ 

Stt etnt r* f°<^her words, is not marked by 

appear ^r'aU mT'^ r order and stability 

appear in all natural processes, and these principles are 

datT^ ThlrSt^ deal with masses of quantitative 

data. Therefore, when we generalize such a measure as 
an index number of wholesale prices, we do so on so^sucL 
Msumption as this; It is reasonable to suppose that in 
the larger population to which this result is to be applied 

or relation we have measured. As a result of this uniformitv 
we s ou d expect statistical measurements derived from 

wiZ '•-‘-‘e 

“It*;!: assumption, in saying 

It IS reasonable to suppose we are intrZinf 

an hypothesis which is incapable of complete SS 

by pmely statistical methods. There is aus to e^y 

cone'luZ’::fr“°“’ “ “ The statisM 

K mus L L d K '* “ ite own feet. 

It muEd be endorsed by reason and judgment if it is to 
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carry conviction. If a high positive coefficient of correlation 
were secured from the study of a sample relating to banana 
importations and sales of new life insurance, this would 
not furnish convincing evidence of a causal relation, or a 
relation of contingency, between these two variables. There 
is no reasonable basis for assuming that, in the larger 
univei\se of phenomena from which the sample was drawn, 
there would be unifornaity with respect to this relation- 
ship. 

Statistical inference differs from the general process of 
induction in that a quantitative result is generalized. We 
seek to apply to a larger group — the population — the 
value of mean, standard deviation, or (U)efficient of correla- 
tion that has been computed from a sample. The measure- 
ment secured from the sample is an estimate of the corre- 
sponding measurement relating to the population. The 
direct task faced in such generalization is that of determining 
the limit.s within which these estimates would probably 
flu(‘l.uate, ii bast'd upon a numlter of different .samples 
drawn from the same population. A number defining these 
limits wall serve as a measure of the reliability of the given 
results, when generalized to apply to the popuhition. 

We should make clear at this point the sense in which 
the term population is used. When we speak of a popula- 
tion we are referring to an aggregate, whether of persons, 
things, or measurements, having certain common charac- 
teristics, or generated by a given system of causes. The 
term may refer to a hypothetical population from which 
a given sample may or may not have been drawn, or to a 
parent population of which a given sample is assumed to 
be representative. It may be a population of prices, or a 
population of cephaUc indices; the term is not restricted 
to a population of peraons. R. A. Fisher speaks of a “popula- 
of possibilitias/* referring to thc! passible results of 
^ exi»riment many times repeated. Of high importance 
to;^t^istic8 are populatbtw of statistical measurements — 
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means, cGefficients of variation, standard deviations, etc. 
It is proper to note that the populations to which most 
statistical results apply are infinite in size. Statistical 
generalizations relate to hypothetical universes containing 
infinite numbers of units. We assume a sample to be 
drawn, not from the finite population that might be covered 
by actual enumeration, but from the infinite population, 
or universe, that would be generated if the forces or system 
of causes that brought this sample into being were to 
operate without limit. (Statisticians have given some atten- 
tion to special techniques, appropriate for dealing with a 
finite universe, but problems with which we do not here 
deal are faced in such applications.) 

The principle of the uniformity of nature is assumed, of 
course, to apply to the universes from which our samples 
are drawn, if these samples are to be made bases of inductive 
generalizations. We must assume that these universes are 
stable, and that all their attributes are stable. An attribute 
of such a stable universe may not be exactly determined 
from the attribute of a single sample, but measurements 
defining the attributes of numerous samples drawn from 
the same universe will be distributed about the true value 
(i.e., that of the universe) in a systematic fashion. Each 
sample value is, of eoui-se, an estimate of the true value 
of the corresponding attribute of the population at large. ^ 
The precise determination of the characteristics of this 
distribution of estimates is essential to the determination 
of their reliability. 

Having knowledge of this distribution we may deter- 
mine the limits within which estimates derived from dif- 
ferent samples of the same population may be expected 
to fluctuate. A measure of these limits will serve as a 

^ By convention, not yet generaliy adopted, but useful, the attribute of 
the population which is being estimated is termed a parmmier^ while the esti- 
mate of it is termed a statuHc, Our certain knowledge is limited to statistics. 
We Uv^e this knowledge to the best of our ability to provide us with approx- 
imations to the true parameters which we can neyer know. 
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measure of the reliability of the given results, when gen- 
eralized. Such a measure might be secured by the labori- 
ous process of studying a great many different samples, 
just as the dice were thrown 4,096 times in a preceding 
example. Thus we might desire to test the reliability of an 
average of weekly earnings of a certain class of workers. 
A first average might be secured from a sample composed 
of 2.50 individual records. This result might be tested by 
computing 499 additional averages, each based on 250 
individual records. These 500 averages would not be 
identical in value, but if they were tabulated a frequency 
distribution closely approximating the normal type would 
be secured. From this distribution we might compute the 
mean of all the averages and the standard deviation of 
these averages. This standard de\’iation would serve as a 
measure of the variation found in the averages of weekly 
earnings, as computed from successive samples. 

We have noted at an earlier point, that a Gaussian or 
normal distribution is generated when three general condi- 
tions prevail. These are: 

a multiplicity of forces affecting each observation 

independence of the various forces affecting each observa- 
tion 

equality of the forces tending to generate values above 
and below the mean value. 

The process of random sampling which would, presumably, 
be employed in securing the successive samples referred to 
in the preceding paragraph should satisfy the conditions 
giving rise to the normal distribution. There should be no 
special or unbalanced influences affecting particular samples. 
The difference between successive samples should be such 
as arise from a combination of forces, intermingled and 
not open to separate definition; that is, from “the mass of 
■ floating causes known as chance.” If these conditions be 
; and if the field of observation (i.e, the universe being 

I , 
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sampled) be homogeneous, the distribution of means com- 
puted from the successive samples would be normal. 

This is a fact of high importance to statistical inference. 
In the realm of original observations, relating to persons, 
things, or events, normal distributions are the exception, 
rather than the rule. But the measurements which the 
statistician derives from successive samples, and which he 
employs in the inductive reasoning by which he generalizes 
his results, are far more frequently distributed in accordance 
with the Gaussian law. Much of the power of statistical 
instruments derives from this fact. 

The statistical investigator is rarely in a position to 
build up a frequency distribution of constants derived from 
numerous samples. It is generally impossible to take 400 
or 500 successive samples, in testing the reliabiUty of a 
given measurement. As a practicable alternative a process 
of mathematical deduction is employed, in determining the 
characteristics of distributions of statistical measurements 
derived from random samples, drawn under stated conditions 
from given populations. An example of such mathematical 
deduction is provided by the derivation of the mean and 
standard deviation of a distribution generated under the 
following conditions: 

p, the probability of a given event occurring, k known 

q, the probability of the event not occurring, is known 

n, the number of independent events in a single trial, is 

known. 

Under these conditions, as was noted in the preceding 
chapter, M = np, and a = ^npqJ By a somewhat similar 
chain of reasoning, we may determine the characteristics 
of a distribution composed of arithmetic means of a number 
of samples of constant size drawn from a given population. 
The standard deviation of such a distribution is given by 


^ For proofs, see Appendix B. 
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where au is the required measure, cr is the standard deviation 
of the population from which the samples have been drawn, 
and iV is the number of observations in each of the samples. ^ 

The determination by deduction of the chai’acteristics 
of distributions of statistical constants derived from samples 
is fundamental to the whole process of statistical inference, 
it is not, of course, a task that needs to be done afresh in 
each statistical investigation. When the law of distribution 
of a given class of statistical measurements has been deter- 
mined, statisticians may utilize the results in their various 
research fields, with due regard to all the conditions under 
which the given law holds This basic task hiis been p(>r- 
formed for most of the statistical mejisuremcnts currently 
employed. Earlier approximations have been I'efined in 
recent years for many classes of statistical measurements. 
The statistician today may draw ui)on a considerable body 
of tested and verified materials in iletermining the relia- 
bility of various kinds of statistical estimates. These 
materials exist in the form of shorthand expressions for 
the standard errors of different statistical constants, and 
in prepared tables for use when the distributions deviate 
materially from the type defined by the normal law of error. 

FbacticaIi Problems op Sampling 

The preceding discussion has dealt with one aspect of 
statistical induction. The argument has proceeded on the 
assumption that inferences concerning the attributes of 
a population would be based upon a sample thoroughly 
representative of the universe from which it was drawn. 
The securing of such a sample is a first condition of valid 
statistical induction. Practical problems of the first impor- 
tance are faced in the actual field work of sampling. The 
procedures employed in such field work lie, in the main, 
beyond the scope of the present book, but it is desirable 

V 
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that the general nature of sampling techniques be indicated. 
References given at the end of the chapter deal in greater 
detail with these procedures. 

The task of securing an adequate sample calls, on the 
negative side, for an avoidance of bias in the individual 
observations and of preventable errors in schedules and 
tabulations. The term bias is applied to observational 
errors that are cumulative and non-compensating. Personal 
prejudices on the part of reporters, mental attitudes of 
which the subjects may be unconscious, or the mere physical 
conditions of observation may lead to persistent errors 
that distort samples. Errors in recording and tabulation 
are easier to detect. Training of enumerators and careful 
editing of schedules and tables will keep such errors to a 
minimum. 

On the positive side sampling technique is directed toward 
the securing of a sample that is truly representative of the 
universe of inquiry. This is a major task, calling for a 
high degree of care and judgment in planning field opera- 
tions conforming to the ultimate objectives of the study. 
A. L. Bowley has classified, under the four heads distin- 
guished below, methods suitable for use in securing a 
representative sample. 

The method of random selection is employed when the 
entire population to be sampled is treated as a whole, and 
members of the sample are so chosen as to be random 
members of that population. In this selection the indi- 
vidual choices must be independent of one another, and 
the chance of any member of the entire population being 
included in the sample must be the same as that of every 
other member. As regards the conditions of selection there 
should be present no element of preference or bias that 
would tend toward the inclusion or exclusion of certain 
members of the larger group. The general requirement 
here laid down should be interpreted, as J. M. Keynes 
has pointed out, to mean that with respect to the purpose 
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of the particular investigation the members of the sample 
should be random members of the population at large. 
Intelligent planning is needed in securing a purely random 
sample. The obvious procedure of picking the most readily 
available cases would by no means meet the condition of 
random selection. Certain important elements in the uni- 
verse of facts to which the conclusions are to be applied 
may be excluded through the play of an unconscious bias 
unless careful attention is given to the selection of cases. 

The population from which a given sample is to be selected 
is often not homogeneous, with reference to the purpose of 
particular investigation. Slum districts and wealthy districts 
may both have to be covered, in a study of social or eco- 
nomic conditions. Agricultural districts differing mate- 
rially in fertility may be included in a farm survey. If, 
by a process of stratification, the universe of inquiry 
may be broken into sub-groups individually more homo- 
geneous than the total population, the reliability of sampling 
results may be substantially incresised. Within each sub- 
group random selection may be employed. This method 
is termed stratified random selection. The size of each group 
in the sample should be proportionate to the relative 
importance in the total population of the stratum repre- 
sented by that group. Where homogeneous sub-groups are 
secured by the process of stratification, and where the 
differences between the sub-groups are pronounced, this 
method is distinctly superior to that of random selection 
among the undifferentiated members of the population at 
large. 

In using the third method, that of purposive selection, the 
statistician seeks to secure a sample having the same 
characteristics as the universe of inquiry in respect of one 
or more “control” factors. If these controls are highly 
correlated with the quantities that are the objects of investi- 
pition, this method of selection gives obvious support to 
generalizations based on the study of the sample. As in 


sampling 


m 


stratified selection, sub-gronps are employed.. These sub- 
groups are chosen not' at random, but in such a way as to 
possess, in the aggregate, the same attributes (e.g., means, 
standard deviations) as the population at large, in respect of 
the control factors. Deliberate manipulation, often through 
a process of trial and error, is necessary to effect this agree- 
ment between the sample and the totality. 

When this method is employed the statistician must, 
of course, have information concerning the “controls’' for 
the total population. The application of the method is 
restricted to fields in which such knowledge is available. 

Census type inquiries on population, agriculture, and manu- 
factures provide such basic knowledge. Promising work 
has been done in purposive selection in dealing with agricul- 
tural data. 

The fourth method, that of stratified purposive selection, 
represents a combination of the use of stratification to 
secure homogeneous sub-groups and of deliberate selection : 

through the use of controls. Where data are open to such 
stratification, and where necessary controls are available, - 

the combined procedures may profitably be employed. r 

When a representative sample has been secured, when f[ 

errors and bias have been avoided, we may still expect the I 

attributes of the sample to differ from those of the total 
population. The effects of fluctuations of sampling will | 

still be present, so long as the coverage of the sample falls [ 

short of the universe of inquiry. We may only estimate the 
attributes of the population; we still face the uncertainties 
that inhere in induction. It is possible, however, to define 
with considerable precision the probabilities involved in 
statistical induction when the differences between the 
attributes of the sample and those of the total population 
are due to fluctuations of “simple sampling,” that is, to | 

the scrambled inass of causes that constitutes chance. 

Under these conditions it is possible to assign in advance 
limits within which we may expect statistical measures 
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derived from different samples of the same population to 
fluctuate. This means that we may apply to the population 
at large statistical measures secured from the study of a 
sample, not with confidence in their perfect stability, but 
with fairly definite knowledge of the margin of error involved 
in thus extending our results. Where the necessary condi- 
tions are fulfilled statistical induction is a valid procedure. 

Use op Measitees op Eeliabiuty 

Measurements defining the sampling errors to which given 
statistical constants are subject are put to various uses. 
It is in order now briefly to review the standard errors of 
different statistical measurements, and to illustrate then- 
applications. 

sampling erboks: the mean 
For the standard error of an arithmetic mean we have 


O'M 


VN 


where the symbol <r in the numerator of the right-hand term 
refers to the standard deviation of the population from 
which the sample is drawn and N is the number of observa- 
tions in the sample. Actually, of course, we do not know 
the standard deviation of the population, but we use jis 
an appi-oximation to it the standard deviation of the 
sample. The approximation is acceptable except when the 
number of observations in the sample is small, in which 
case special treatment is needed.* 

Reference has been made above to the fact that a dis- 
tribution of arithmetic means computed from random 
samples of a given population usually follows the normal 
law of error. This is true even though the distribution 
of the population from which the samples are drawn is 
not itself normal. Accordingly, we may interpret given 

; ; " ^Bm CImpto XVlIi 


values of ctm with reference to the probabilities associated 
with deviations in a normal distribution. 

Table 34 in Chapter V shows the distribution in 1933 
of 11,404 workers in open hearth steel furnaces, classified 
according to their average hourly earnings. The arithmetic 
mean of this distribution is 50.14 cents; the standard 
deviation, which we may here represent by s, is 18.685 cents. 
Accepting this standard deviation as an approximation to 
the standard deviation of the population from which this 
sample was drawn,* we have 


(^M = 


S 

VJT-l 


18.68 5 

vTIaos 


. 175. 


The true mean of the hourly earnings of wage workers 
in open hearth furnaces in 1933 is not known. The figure 
50.14 cents is our best approximation to it. If we should 
draw many samples, each the size of the one we have here, 
we should have many mean values normally distributed 
and centering, we may assume, at the true value. The 
standard deviation of this normal distribution we estimate 


^ The formula for tlie standard error of the mean, when the <r of the popula- 


tion is known, is given by cm ~ 


VN 


When the standard deviation of the 


population is replaced by that of the sample (s), as an approximation to the 
desired quantity, the formula for cm may be written 


<^M 


VN 


> or CM '■ 


VN-1 

The fir st of t hese is appropriate if s has been derived from the relation 
g as ^ (where d is the deviation of a single observation from the 


mean); ^ second is appropriate if s has been derived from the relation 


N 


In other words, AT should be reduced by 1 either in the derivation 


of 8 or in the derivation of cm- If is derived from the of the original data, 
the single operation is summed up in BesseFs formula 






iV(Ar*~i) 

(See Whittaker and Robinson, Calculm 0 } OhsermdionSf London, Blackie & 
Son, 1024, 205-200,) The reason for the reduction of N is discussed in Chap- 
ters XV and XVI.ll, in dealing with degrees of freedom.’^ 



466 INDUCTION AND SAMPLING 

as .175 cents. Knowledge of this standard deviation, or 
standard error, enables us to set limits within which it is 
highly likely that the true mean lies. Any statements we 
make about the true mean are to be interpreted with 
reference to this figure. 

We might, for example, on the basis of these results, 
make the flat statement: The true mean of the population 
lies between 49.965 cents and 50.315 cents. (The first of 
these limits is the sample mean plus one standard error; 
the second is the sample mean minus one standard error.) 
We may not assert that this statement is certainly true. 
It may be true or false. But if we continue indefinitely to 
draw samples from the population in question, computing 
the mean of each and the standard error of that mean, and 
if we make a statement about each similar to that made 
above, 68 out of 100 such statements will be true. (The 
actual numerical limits set by the different statements will 
differ, of course.) 

It is possible to vary the statement according to the 
degree of probability we wish to work with. Thus we might 
say: The true mean of the population lies between 49.80 
cents and 50.48 cents.' Of an indefinitely large number 
of such statements, each based on the study of a sample 
similar to the one before us, we know that 95 out of 100 
would be true. This is the kind of knowledge we have 
about generalizations based on results obtained from samples. 

The essential facts concerning the mean of the present 
sample and its reliability may be summarized in the state- 
ment: The mean hourly earnings of wage workers in open 
hearth furnaces in 1933 was (in cents) 50.14 ± .175.^ 

*49.80 = 50.14 - (1.96 X .175) 

50.48 = 50.14 +■ (1.96 X .176) 

Nimety-tve per mnt of the area under a normal curve is Included within 

- , ^The measure of sampling reliability here given m the standard error. The 
t tWMillonal now I«! wmmonly followed, hm been to give tlie probable 
; ’ which k .6745 times the standard error. In the present example the 
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The staBdard error of a mean is frequently ' used, not 
merely as an abstract measure of sampling reliability, but 
as an instrument for testing a given hypothesis. Such 
an hypothesis usually involves an assumed parent popula- 
tion, and the test centers about the question whether a 
given sample could have been drawn from this parent 
population. Let us assume that, on rational grounds, we 
have set up the hypothesis that the mean duration of 
business cycles is five years. We have observations relating 
to 77 cycles occurring in various countries during stages 
of rapid industrialization.^ These cycles are distributed, in 
respect of duration, as follows: 


Duration of cycleSf 
in years 

1 

2 

3 

4 

5 . 

6 

7 

8 
9 

10 


Number of 
cycles 

3 

10 

22 

U 

12 

8 

2 

2 

2 

J 

77 


The mean duration of these 77 cycles is 4.09 years, and 
the standard deviation of the distribution is 1.88 years. 
For the standard error of the mean we have 


<T]if 


1.88 _ 

vfr=~i ~ 


.216. 


Are these results consistent with the hypothesis that our 
sample of 77 cycles is drawn from a parent population 


, in any case, to specify 


probable error of the mean is .118 cents. It is weU, 

the exact HieaBure of reliability being used. 

'P'-oWm and Ita Setting, New 
•'An Hvnotw- Research, 1927, 412-416; F. C. MUls, 

toe Duration of Business Cycles,” Journal of 

the .4mencart StaUstical Associaitony lhmmkmt 1926, Vol. 21, 447-457. 
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(i.e., a universe of cycles generated under similar conditions) 
with a mean duration of five years? 

If we use M to represent the mean of the sample data, 
Mh to represent the hypothetical mean of the universe, 
and T to denote the deviation of our sample mean from 
the hypothetical mean, expressed in units of the standard 
error of the mean, we may write 

_ M -Mft ^ 4.09 - 5.00 ^ - .91 _ , o. 


The figure . 216 is, according to our hypothesis, the standard 
deviation of a distribution of arithmetic averages the mean 
of which is 5.00. If we were drawing from such a 
distribution, the mean of our present sample would represent 
a departure of 4.21 standard deviations from the general 
mean. l^Tiat is the probability of such a departure occurring 
mcsrely as a result of chaiuie? Consulting a table showing 
areas under the normal curve, we find that the area on 
one side of the mean, lying at a distance of 4.21 standard 
deviatioirs or rnon; from the mean, constitutes 1/100,000 
of the total area under the curve. In terms of probabilities, 
this means that there is only one chance in 100,000 that 
a member of the population represented by the normal 
curve will fall below the mean value by 4.21 standard 
deviations or more. This chance is so remote that we say 
the event in question could not occur. With reference to 
the present problem, we conclude that the results are not 
consistent with the hypothesis. We could not have secured 
the sample values in question had we been drawing from 
a universe of cycles with a mean duration of five years. 
The results fail to confirm the theory we have set up. 

The probability cited (1/100,000) relates to a deviation 
side of the hypothetical value only. If we wish to 
probability of an observation departing from the 
mean value 6.00 by 4,21 standard deviations 
without reference to whether the denarture be 
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above or below the hypothetical value, we must double 
the above probability. The chance of such a departure in 
one or the other direction is 2/100,000. Tests of hypotheses 
usually take this latter form. It is customary to ask whether 
a deviation of a stated magnitude could occur, and to 
measure the probabilities involved with reference to devia- 
tions in both directions. 

In using tables of the normal probability integral in 
tests of this type we are generally concerned with the 
probability of occurrence of deviations as great as or greater 
than some stated value (in the above example, .91 years, 
or 4.21 standard deviations). This probability is repre- 
sented by areas in the two tails of a normal curve (assuming 
that deviations either above or below the mean are in 
question). The inside limits of these segments are set 
by ordinates erected at distances from the mean equal 
to the deviation in question; the outside limits are at 
infinity. (See Fig. 85, in Chapter XIII, for a graphic 
representation of segments lying beyond stated limits.) The 
usual tables of the probability integral define the areas 
falling within limits set by ordinates at specific points. 
Our concern is with areas beyond these ordinates. Sub- 
traction of the internal area from the whole area (unity) 
will, of course, give the area of the external portion defining 
the probability that is here desired.^ 

If we should be testing the h 3 q>othesis that the mean 
duration of business cycles is four years, we derive the 
value of T as follows: 

.-islo 

From the tabulated values of areas under the normal curve 

^See W* Edwards Deming and Raymond T. Birge, ^‘On the Statistical 
Theory of Errors/^ Reviews of Modern PhyncSf Yol. 6, July, 1984, 138ff,, for a 
discussion of this probability, which they designate Fw, and tests based on it. 
(In their terminology, u is the difference between the mean of the sample 
and the mean of the assumed population.) This article includes a chart (184) 
for use in determining the significance of a given deviation. 
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we determine that approximately 67 per cent of all the 
observations in a normal distribution will deviate from 
the mean value by .42 standard deviations or more. We 
interpret this to mean that if our sample of 77 observations 
were drawn from a universe with a mean value of 4.00 
years, the chances are 67 out of 100 that the mean of the 
sample would depart from the population mean by .09 years 
or more. (We have counted the combined probabilities 
of deviations above and below the population mean.) In 
other words, a deviation as great as the one we have experi- 
enced is highly probable. The results are not inconsistent 
with the hypothesis that the mean duration of business 
cycles is 4.00 years. They do not, be it noted, prove the 
hypothesis. All that we may say of statistical evidence, 
on the positive side, is that it is not inconsistent with a 
given hypothesis. Supporting statistical evidence strengthens 
our confidence in the hypothesis, of course. Its tenability 
must be determined on the basis of rational considerations 
as well as empirical evidence. 

This last point deserves emphasis. “The significance of 
each test,” say Deming and Birge,* “depends not only on 
the value of P (i.e., the measure of probability appropriate 
to the test) that is found, but also on how much is known 
a priori regarding the parent population.” The above 
hypothesis of a four-year cycle has no particular rational 
basis (the figure was used here, of course, to exemplify 
a procedure). The fact that the observed results are not 
inconsistent with it is significant in a negative way, but 
does not establish the truth of the hypothesis. Low values 
of P, indicating that the facts are inconsistent with given 
hypotheses, are highly useful in leading us to reject tentative 
formulations of theory. Acceptable values of P, however, 
need the support of other knowledge (a priori and empirical) 
concerning the body of materials being studied and the 
r^larities prevailing therein. Within the limit of acceptable 
,*L<ic, eii., 137. 
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values, indeed, we may accept one hypothesis, rather than 
another for which empirical tests yield a higher value of 
P, because the former is more consistent with the general 
body of existing knowledge concerning the field in ques- 
tion. 

In the two tests we have applied, no difficulty was encoun- 
tered in interpreting the probabilities bearing on the relation 
between the hypothetical mean and the observed facts. 
In the one case the odds were so small as to leave no doubt 
as to the lack of agreement ; in the other case the difference 
was clearly insignificant. But many tests will lie on the 
borderline, and we must have some reasonable criterion 
as to the limit of significance. Odds of 1 out of 100 constitute 
one conventional standard. If a given difference between 
hypothetical and observed values would occur as a result 
of chance only 1 time out of 100, or less frequently, we may 
say that the difference is significant. This means that the 
results are not consistent with the hypothesis we have 
set up. If the discrepancy between theory and observation 
might occur more frequently than 1 time out of 100 solely 
because of the play of chance, we may say that the difference 
is not clearly significant. The results are not inconsistent 
with the hypothesis. The value of T (the difference between 
the hypothetical value and the observed mean, in units 
of the standard error of the mean) corresponding to a 
probability of 1/100 is 2.576. One hundredth part of the 
area under the normal curve lies at a distance from the 
mean, on the x-axis, of 2.576 standard deviations or more. 
Accordingly, tests of significance may be applied with direct 
reference to T, interpreted as a normal deviate (i.e., as a 
deviation from the mean of a normal distribution expressed 
in units of the standard deviation). A value for T of 2.576 
or more indicates a significant difference, while a value of 
less than 2 . 576 indicates that the results are not inconsistent 
with the hypothesis in question. 

There is, of course, nothing rigid about this particular 
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standard. Some statistical workers employ odds of 1 out 
of 20 as a limit, rather than 1 out of 100. With this standard 
we would accept as significant (i.e., not due to chance) a 
difference between hypothetical and observed values that 
would occur only 5 times out of 100, or less frequently, as 
a result of random fluctuations of sampling. The value 
of T corresponding to this standard is 1.96. The standards 
of significance actually employed by a research worker may 
well vai-y from problem to problem. The investigator uses 
the results of these tests of significance as aids in the inter- 
pretation of his results and in the development of a body of 
theory that is not inconsistent with the evidence provided 
by experience. In the interplay of deduction and induction 
that marks such a proce^, no single absolute standard 
for the rejection or acceptance of hypotheses would be 
appropriate. 

The formula for the standard error of a mean, as given 
above, relates to a sample chosen by random selection. 
For a i)roportionately stratified sample the standard error 
of the mean, cr„„ may be derived from the relation 


Cm/ 




where era is the standard error of the same mean as it would 
have been had the N observations been taken at random 
from the universe of inquiry, and is the standard deviation 
of the averages of the several strata about the average 
of the whole sample.^ In computing <7», the deviation of 
the mean of each stratum is weighted in proportion to the 
number of cases in that stratum. N is the total number 
of observations in the sample. It is clear from the formula 
that the standard error of the mean of the stratified sample 
is smaller than the standard error of a corresponding random 
sample. 

> Tile ftbove formula ia from A. L. Bowley, EUmmita of StaUslics, London, 
Eiu|(, Bixiii ed., 1^7. 
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SAMPLING BREOBS: MEDIAN AND QDAETILES 

The median is subject to greater sampling fluctuations 
than is the mean. The degree of dispersion of median 
values derived from a number of samples of a stated size 
from a given population will be approximately 25 per cent 
greater than the dispersion of the arithmetic means of 
the same samples. More exactly, we have 

(Tm = 1.25331 

V iV — 1 

Estimates of the quartiles, in turn, are less accurate than 
are estimates of the median. For these we have 


= 1.36263 


VW - 1 


SAMPLING errors: STANDARD DEVIATION 

In determining the magnitude of the sampling errors to 
which the standard deviation is subject we must distinguish 
between samples drawn from a normally distributed universe 
and those derived in the more general case, in which the na- 
ture of the distribution of the universe is unknown. If the 
distribution of the universe is normal we have, as the esti- 
mated standard error of o-, 

s 

V2N 

(where N—l has been used in the computation of s). Thus, 
for the universe of residential telephone subscribers repre- 
sented by the distribution in Table 109, we have 

147.7 


O', = 


3.31. 


^1,990 

The more general formula for the standard error of the 
standard deviation involves the fourth as well as the second 
moment of the distribution: 


i/l 




4 m -N 
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For the distribution based on hourly earnings in open-hearth 
steel furnaces in 1933 the standard deviation was 18.685 
cents (see Table 34). As the standard error of this measure- 
ment we have* 


" T 4 X 13.9674 X 11,404 

Since the moments here employed are in class-interval 
units, the derived measmement is also in those terms. In 
the original units we have 

= . 0432 X 5 cents = .2160 cents. 

Many tests of significance involve the use of standard 
deviations and corresponding measurements of sampling 
reliability. These are discussed more fully in the chapter 
on the analysis of variance. 


8AMPUNC5 errors: COEFFICIENT OF CORRELATION 

A number of distinctive problems are faced in generalizing 
the results of correlation studies and in determining the 
significance of the measurements secured in such studies. 
Certain of the.se problems are discussed in the succeeding 
chapter, and Chapter XVIII deals with important limita- 
tions that are faced when the samples employed are small. 
At this point general methods of measuring the reliability 
of correlation measurements are presented, without certain 
of the qualifications that will be discussed later. 

As a basic formula for the sampling error of the coefficient 
of correlation computed from N pairs of observations, we 
have 


where r, in the numerator of the right-hand member, is 
the true coefficient of correlation in the population at large. 
Since we do not know the true r we must use the r of the 

* ®nce Sheppard's corrections are not appropriate to this distribution, the 
WMJprrected moments are used. 


SAMPLING ERRORS 475 

sample as an estimate of the required value. This formula 
may be taken to hold for distributions approaching the 
normal type, when the number of cases included in the 
sample is fairly large — say 50 or more. When the sample 
is small and, particularly, when we are dealing with a 
relatively high coefficient of correlation derived from a 
small sample, the standard error secured from the formula 
cited above may be faulty, and tests of significance based 
on it misleading. The reason for this and means of meeting 
the difficulty are discussed in Chapter XVIII. 

In exemplifying the application of the usual test, we 
may employ results presented in Chapter X, on the relation 
between the discount rates of Federal Reserve banks and 
of commercial banks. The value of r is + .84, while N 
equals 1,800. Accordingly, we have 


CTt 


^ - (- 84 )=^ _ .2944 
VT,800^ 42.40 


.007. 


The standard error of r is frequently used, as are similar 
measurements relating to other statistical constants, to test 
hypotheses. We may put such a question as the following; 
Is the value of r secured from a given sample significant 
of a real relationship between the variables in question 
in the population from which the sample was drawn? 
Putting the question in form more appropriate for testing: 
Is the present value of r consistent with the hypothesis 
that there is no relationship between the variables in 
question in the population at large? R. A. Fisher terms 
such an hypothesis a "null hypothesis.” The purpose of 
experiment, in his words, is to give the facts a chance of 
disproving the null hypothesis. 

In a study of the movements of commodity prices, 1,202 
measurements were secured on the timing of advances in 
the prices of individual commodities during periods of 
general business revival. Paired with each measurement 
was a similar observation on the timing of the decline in 
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the price of the given commodity during the succeeding 
period of general business recession^ We desire to know 
whether there is any relation between the sequence of price 
revival and the sequence of price recession. Is there a 
pattern in price movements during business cycles? Evidence 
of the existence of such a persistent pattern would lend 
support to the view that cycles represent true regularities 

in economic life. _ 

These 1,202 pairs of observations yield a correlation 
coefficient of + .27. This does now show a pronounced 
degree of relationship. Our chief concern, however, is not 
with the magnitude of r. We wish to know whether the 
result is consistent with the hypothesis that the true corre- 
lation is zero. For the standard error or r we have 


Vr 


Vi, 202 - 1 


.029. 


By hypothesis, the population value of r is zero, so the 

numerator of the fraction is 1. 

If the true value of r w'ere zero, and the standard error 
of r were .029, what would the probability be that, as a re- 
sult of chance, we should secure a coefficient of -f- .27 from 
a given sample? Bin<« this value represents a departure 
of more than f) (t’s from the hypothetical value of zero, 
the probability that the difference is due to chance is 
infinitely small. We conclude that the results are not 
consistent with the hypothesis that the sequence of price 
change during revival is unrelated to the sequence of decline 
in a succeeding recession. The null hypothesis is disproved. 

Had the value of T ^in this case T = ——2^ been less 

than 2.576 the conclusion would of course have been differ- 
ent. In such a case the discrepancy between the sample r 
and thc! hypothetical value of zero could be attributed to 

» The Hehamr of Fricm, New York, National Bureau of Economic Research, 
1927 . 131 . • 
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sampling fluctuations. The result would not be inconsistent 
with the null hypothesis. 

Having established that the results are not consistent 
with the hypothesis that the true value of r is zero, we may 
compute the standard error of r as actually derived. Assmn- 
ing now that the sample is drawn from a parent population 
in which r = + . 27, we have 

= .027. 

Vl,202-1 


SAMPLING ERROES: INDEX OP CORRELATION 

The standard error of the index of correlation may be 
approximated from the relation 


In this’ formula m represents the number of constants in 
the equation of regression. In the example cited in Chapter 
XII, relating to alfalfa yield and depth of irrigation water, 
p is .80, N is 44, and m has a value of 3. We have, thus 


<r, = = .056. 

V44 - 3 

The use and interpretation of this measure are analogous 
to those of (Tr. In the present instance the index of correlation 
is clearly significant. t 


SAMPLING errors: THE TEST FOR LINBAEITT 
As a test for linearity we have been given 
r = 172 - r2. 

But we wish to know whether, in a given case, the difference 

^ See Ezekiel, M., Methods of Correlation Analysis^ N. Y., John Wiley and 
Sons, 1930, 257"“258, for a discussion of the sampling reliability of the index of 
correlation. 


478 INDUCTION AND SAMPLING 

between 77“ and naay be due merely to a chance fluctuation 
of sampling, or to a real departure of the underlying rela- 
tionship from the linear form. As the standard error of J" 
Blakeman has proposed 

<rj. = V(1 - 77==)= - (1 - + 1. 

The use of this measure may be illustrated with reference 
to the problem relating to wheat yield which was considered 
in an earlier chapter. For the relation between wheat 
yield and amount of nitrogen used as fertilizer, we had 

r = -f .793 
77 = .965 

;■ N = 193 . ;: . 

(The imcorrected value of 77 .should be used here.) 

Therefore 

J = . 302 . 

Inserting the given values in the formula for Cf and solving, 
we have 

. 074 . 

With f having a value of . 302 , about 4 . 08 times its standard 
error, there can be no question as to the non-linearity of 
the relationship. The difference between 77^ and r* is one 
which could hardly be due to chance fluctuations of sam- 
pling. 

The criterion 77® — is not very satisfactory as a test 
of linearity, since the distribution of f does not follow the 
normal law. The same weakness attaches to the correla- 
tion ratio. As Fisher has demonstrated, the distribution of 
77 does not tend to normality, even with large samples, 
unless the number of arrays is increased without limit. 
Accordingly, the standard error of 77 is of dubious utility. 
More efficient methods of testing for the existence of 
correlation, and for linearity, are discussed in Chapter XV. 
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SAMPLING EEEOES: COEFFICIENT OF EANK COEBELATION 

The standard error of the coefficient of rank correlation 
has been given by “Student” as 

_ _ 1 

VJT^l 

It is notable that this value is independent of the true 
value of Pr.^ This standard error may be taken to relate 
to a normal distribution, and interpreted in the familiar 
manner, when N is fairly large, say 45-50 or more. For 
small samples the distribution of p is not normal. In the 
example cited in Chapter X, dealing with the relation 
between the number of individual income tax returns and 
the number of passenger automobiles registered in 1934, by 
states, we had p, = .94. Since there are 47 observations, 
the value of is given by 



.147. 


The sample is large enough to justify the assumption that 
the distribution of pr would approximate the normal type. 
The coefficient of rank correlation is clearly significant, 
being more than six times its standard error. 


SAMPLING EEEOES : COEFFICIENT OF EEGEESSION 

High importance frequently attaches to the coefficient 
of regression, in dealing with relationships among variable 
quantities. For the standard error of this measurement we 
have* 

-v/Sx* 

where x is a given value of the independent variable, 

1 See Hotelling and Pabst, he. ciL 

® See R. A. Fisher, Statistical Methods for Mesearch Workers^ Edinburgh, 
Oliver and Boyd, sixth edition, 1936, 134-146, 
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expressed as a deviation from the mean of that variable, 
and Su is the root mean square of the deviations of the 
actual values of 2/, the dependent variable, from the corre- 
sponding computed values. That is, is a measure of the 
scatter about the line of regression.^ 

A test involving the use of o-j, may be applied to data 
relating to the average corn yield per acre in Kansas, by 
years, from 1890 to 1933 (see Table 128, Chapter XVI). 
These yields show a fairly consistent declining trend. A 
line of trend fitted to the figures for these 44 years is defined 
by the equation 

F = 22.05 - .1074X 

where Y denotes corn yield per acre and X denotes time, 
in years, with origin at 1889. We wish to know whether 
the coefficient of regression (i.e., the slope of the line of 
trend) represents a significant departure from zero. The 
hypothesis we are testing is, then, that the true value of 
the coefficient of regression, in the population from which 
this sample is drawn, is zero — that there has been no 
significant decline in corn yield in Kansas over the period 
in question.* 

For Sy we secure the value 6.70, for the value 

84.2. Accordingly 

6.70 


84.23 


.0795. 


We may denote by the symbol g the coefficient of regressio n 


% 


/I 


(y - 


wliere y deoctm a given value of the dependent variable and yc denotes the 
corresponding value derived from the equation of regression. In the computa* 

■ lion of % for,, this purpose N must .bo reduced, by the .number of, constants in/ 
; eqmtioii of regression, 

^ The hypothetical population of .which' we assume our sample to, be repre- 
gttatative in the population that would be generated by the forces responsible 
;'fa; variation in Kansas mm yields from .1890,' to' 1983, .'if .those, forces, 'ub'- 
' ^lac^edi were to act upon an infinite number of oases. The application of 
'-$3^ and of the whole probability calculus, to data ordered in time 

> 'liwdliw mm» logical dffliculties, which are discumed at a late point. 
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assumed in our hypothesis (in this case zero). We wish to 
know whether the deviation of our actual b from this 
hypothetical 0 may be attributed to chance, or whether 
it is too great to be so explained. This deviation should 
be expressed in units of the standard error of b, in order 
that the probabilities underlying the normal distribution 
may be appUed in our reasoning. Using T, as before, to 

denote the deviation in units of (T, we have 

T ^ - -1074 - 0 

= -1.35. 

The given value of b represents a departure of 1.35 
standard deviations from the mean value of zero in our 
hypothetical population. As may readily be determined by 
reference to_ the table of the probability integral, such a 
deviation might easily occur, as a result of chance alone. 
The results then, are not inconsistent with the hypothesis. 
There is no clear evidence here of a .significant decline in 
corn yield per acre in Kansas during the period covered. 

SAMPLING errors: DIFFERENCE BETWEEN MEANS 

A problem of sampling that arises rather frequently is 
that of determining whether two samples could have been 
drawn from the same parent population. Obviously, there 
would be some difference between the mealis of two samples 
from the same universe, as there would be between standard 
deviations or coefficients of correlation secured from different 
sampling operations. We may illustrate the procedure 
employed in determining the significance of a difference 
between two arithmetic means. 

Reference has been made above to a sample of 77 business 
cycles, occurring during stages of rapid industrialization. 
Their mean duration was 4.09 years; the standard deviation 
of the distribution was 1 . 88 years. The same investigation 
indicated that the mean duration of 51 business cycles 
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occurring in various economies during early stages of 
industrialization was 5.86 years, and that the standard 
deviation of these measiurements was 2.41 years. There is 
an indication here that business cycles are accelerated, 
that their average length is shortened, when an economy 
is passing through a phase of rapid industrialization with 
corresponding impetus to technological change. In this 
case the null hypothesis against which we set our facts 
is that there is no difference, in respect of duration, between 
business cycles occurring in the two stages of industrialization 
named. 

The difference between two means is a statistical meas- 
urement subject to a definite law of distribution. If a 
great many pairs of samples were drawn from a given 
population, the value D (i.e.. Mi - Ms) could be computed 
from the two means of each pair. A frequency distribution 
of the D’s thus secured would follow the normal law. The 
magnitude of the standard deviation of this distribution 
would be a function of the sizes of the samples thus paired 
and of the standard deviations of these samples. We may 
approximate the standard deviation of this distribution of 
Z)'s from the relation 


Th{‘ measurement needed for testing the hypothesis now 
before us is computed from the relation 


- V. 1627 
= .4034. 

i'he V!iiu(‘ of D, the difference between the t 
.5.8t> - or 1.77. This value of D is ■ 
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with reference to a hypothetical value of zero. Accordingly, 
for T (the discrepancy expressed as a normal deviate) we 
have 


T == 


1 . 77-0 

.4034 


4.39. 


This discrepancy far exceeds the magnitude 2.576, corre- 
sponding to odds of 1 out of 100. If the true value of D 
were zero, a discrepancy as great as this or greater would 
occur as a result of chance about 1 time out of 100,000 
trials. The results indicate that the difference between 
the two means is not due to chance. The facts are not 
consistent with the hypothesis that the two samples are 
drawn from the same population. There is a significant 
difference between the average durations of business cycles 
occurring in early stages of industrialization and in later 
stages of rapid industrial change. 


SAMPLING ERKORS : DIPPERENCB BETWEEN PERCENTAGES 

There are occasions when it is desirable to determine 
whether a difference between two proportions (or percent- 
ages) is significant. Using Dp to denote such a difference, 
we have 



where Po is the weighted mean proportion, 9o is 1 — Po, and 
Ni and Na are the total numbers of cases in the two samples 
to which the proportions relate.’ (In computing this value 
and applying the corresponding test it is necessary to divide 
percentages by 100, to reduce them to the form of propor- 
tions or ratios.) 

A tabulation of American and foreign business cycles 
by Wesley C. Mitchell has indicated a relative preponder- 
ance of three-year cycles in American experience. Of 32 

^ See liornell Hart, ^‘The Reliability of a Percentage/’ Jowmol of the 
American Siaiisiical Association^ VoL 21, March, 1926, 
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American cycles 10, or 31.2 per cent, lasted 3 years; of 
134 cycles in other countries 20, or 14.9 per cent, lasted 
3 years, t Is the difference between these two percentages 
great enough to justify the inference that the forces acting 
upon American business differ from those acting abroad, 
creating a significantly higher percentage of three-year 
cycles? The hypothesis that we test in this case is that the 
difference is not significant, that the groups of American 
and foreign business cycles are drawn from the same 
universe. 

The two proportions, pi and p 2 , with which we work are 
.312 and .149. The difference Dp between the two pro- 
portions is .312 — .149 or .163. For the weighted mean 
proportion we have 

^ _ Nipi + Nifh 


32 -f 134 . 

= 1 — = .8196. 

We compute the standard error of Dp from the relation- 
ship shown above 

<7,,;= .1804 X .8196(1+ 

= .005724 
0-^/= .0757. 

Between the given value of Dp and the hypothetical value 
of zero we have the discrepancy (expressed as a normal 
deviate) 

„ _ .163 - 0 
■ ”“0757 ' 

: = 2.1,5. ' 

A discrepancy as great as this or greater might occur, 
as a result of chance, about 3 times out of 100. If our stand- 

* Si'c BuMness CycUs, the Problem and Its Setting, N. Y., National Bureau of 

EmEomie Research, 1927, SSiMOO. 
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ard of significance is 1 out of 100 we must conclude that 
the difference between the two percentages is not clearly 
significant. The result is not inconsistent with the hypothesis 
we set out to test that American and foreign business 
cycles are drawm from the same universe, in respect of 
the proportion of three-year cycles occuiTing. It is proper 
to say, however, that we are dealing with border line results. 
If our standard of significance were 1 out of 20 we should 
considei the difference between American and foreign 
experience sigmficant. Perhaps we should say that although 
the present evidence does not provide conclusive proof that 
the^ two samples come from different universes, there is 
indication of a difference between the forces affecting the 
relative frequency of three-year cycles in the United States 
and in foreign countries. Such results call for further 
research, in order that a more definite conclusion may be 
reached. 


SAMPLING EHEORS AND SIGNIFICANT FIGITRES 

In deciding upon the number of figures to be recorded 
as significant, measures of sampling errors are, of course, 
pertinent. A useful general rule laid down by Truman L. 
Kelley follows . In a final published constaut, veiain no figures 
beyond the position of the first significant figure in one third 
of the standard error; keep two more places in all computa- 
tions! Its application may be illustrated with reference to 
the figures on hourly earnings of 11,404 steel workers in 
1933. The mean, to four places, is 50.1360 cents. The 
standard error of the mean is . 175 cents. One third of this 
is .0583. The first significant figure is in the column of 
hundredths. By the rule, therefore, the arithmetic mean 
should be given as 50.14 cents. Two more places, or four 
decimal places in ail, should be retained in calculations. 


f The rule here pyen is the KeUey suggestion as re-phrased by P. J. Rulon 
(fceftce, N. S. Vol. 84, No. 2,187, Nov. 27, 1936, 484). I have changed “one 
half the probable error” in Rulon’s statement to “one third of the standard 

error/* 
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Some Limitations to Measures op Sampling Errors 

The importance of such measures of reliability as have 
been discussed above is, of course, great. With their aid 
we may give precision to our judgments concerning the 
margins of error involved in extending statistical results 
beyond the limits of actual observation. Yet limitations 
attach to them, and these must not be forgotten in a purely 
mechanical application of statistical tests. 

Reference has been made to limitations relating to the 
size of samples. In the interpretation of most measures of 
sampling errors the assumption is made that statistical 
measurements secured from successive samples are dis- 
tributed in accordance with the normal law of error. When 
the number of cases is large this is approximately true, even 
though the original data are not so distributed. But with 
a small number of cases in each sample this assumption 
may be (]uite invalid. The significance of given deviations 
(in terms of T) is therefore materially altered when we are 
dealing with results secured from small samples. Techniques 
have been developed, however, for defining sampling errors 
based on small samples. These are discussed at a later 
point (Chapter XVIII). 

Moreover, the conventional standard errors we have 
discussed can be assumed to measure only errors arising 
from the fluctuations of simple sampling. If there is to be 
full conformity to the conditions of simple sampling, the 
probability of a given event occurring must be the same 
in all parts of the universe being sampled and for all time 
periods included, and the indmdual events (i.e., drawings 
or observations) must be completely independent of one 
another. The fact that customary error formulas are 
strictly applicable only when these conditions have been met 
injects elements of doubt into many statistical inductions 
; in the field of economics. We cannot always be sure that 
' the conditions of simple sampling are actually fulfilled. 
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They are rarely, perfectly fulfilled in the handling of economic 
data. ^ The standard errors derived above can give . no 
indication of the possibility of fluctuations . in successive 
samples due to causes other than those arising from simple 
sampling. Fluctuations due to bias, due to lack of repre- 
sentativeness in the sample, due to persistent errors of 
any sort, quite elude this method of determining probable 
stability. ^ -^though some degree of departure from the 
rigid conditions of^ perfect sampling does not deprive the 
measures of reliability of all value, the limitations noted 
must be the constant concern of the statistician. 

The element of time adds one serious difficulty to the 
problem of statistical induction in the realm of economics, 
and in the social sciences generally. A universe that extends 
over time is subject to elements of change that are not 
present among data relating to a cross-section of time. 
Conditions of pig iron production, of banking, of foreign 
trade, of income distribution change from year to year, 
even from month to month. We may hardly assume that 
data relating to different time periods reflect the play of 
identical forces. When we deal with data from different 
periods we are, as Oskar Anderson has pointed out, drawing 
from different universes. The structural changes that occur 
in economic organization are manifestations of this state 
of never-ending transition. Accordingly the homogeneity 
of all populations extending over time is suspect. In par- 
ticular are hazards faced when an induction extends to a 
time period not covered by the data of observation. 

The fitting of trend lines, and the use of deviations from 
trend in statistical analysis, represent one effort to overcome 
difficulties arising out of temporal change. It is assumed 
that variations due to trend reflect the deep-seated changes 
that would introduce elements of heterogeneity into the 
particular universe of inquiry, and that deviations from 
trend may be made the bases of statistical inference. The 
effects of some temporal changes are doubtless removed by 
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this process. But the argument cannot justify the extension 
to a new time period of measures of sampling error based 
on the study of another period, unless it can be established 
that no essential change occmred in the conditions affecting 
the phenomena in question. The probable errors involved 
in such extension, without the validation noted, are not 
capable of definition. For this extension would involve 
generalizing about one universe from the study of an- 
other. 

In the application of statistical methods proper choice 
of objectives, wise planning, and effective field work are 
of at least equal importance with skill in the use of statistical 
techniques. This is especially true as regards problems 
of sampling. Here chief emphasis falls on soundness and 
accmacy in the field work. The problems of field work are 
specialized and particular, arising out of specific problems 
and conditions. Appropriate special knowledge is needed 
for the selection and validation of the sample. 

Much may be done to strengthen a statistical induction 
by making actual statistical tests of the homogeneity of 
the population and of the stability of sampling results. 
By the study of successive samples the representativeness 
of statistical measures may be determined; and by testing 
the subordinate elements of a given sample, when broken 
up into significant sub-groups, the inherent stability of a 
sample may be checked. The uniformity of nature in a 
given field is assumed in every induction. The induction is 
strengthened by every piece of evidence that supports the 
assumption. 
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CHAPTER XV 


THE ANALYSIS OF VARIANCE 

The determination of degree of correlation between vari- 
ables involves, essentially, the comparison of measurements 
of variability. Thus, in the familiar equation 



we are comparing the dispersion about the fitted line of 
regression (iV) with the dispersion about the mean of the 
2/’s (ff/). Again, if we work with the relation 



we are comparing the dispersion of the computed values 
of y about the mean of the y’s {a-,/) with the dispersion 
of the original observations about the mean of the y’s {(t/). 
It is logical thus to compare mejisurements of variation, 
in applying correlation technique, for the purpose of the 
investigator is usually to test an hypothesis concerning the 
forces responsible for variation in the dependent variable. 
He is usually seeking an associated factor which may, on 
some rational basis, be assumed to influence the fluctuations 
of the variable he is treating as dependent. R. A. Fisher 
has developed a procedure to employ in the study of correla- 
tion which is based explicitly upon the analysis of variance. 
We deal in this chapter with certain applications of the 
flexible and powerful instrument Fisher has forged. 

Comparison op Measures op Variability 

We deal first with a simple comparison of two groups, 
in respect of variability. The prices of preferred and com- 
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roon stocks, as quoted on the New York exchanges, may 
be compared, to determine whether they differ significantly 
in variability. Table 29, presented on a preceding page, 
showed the distribution of closing prices on July 25, 1936, 
of 66 piefeired stocks, paying annual dividends of seven 
per cent. With this we may compare the distribution of a 
like number^ of common stocks selected at random from 
those for which prices were quoted on the New York Stock 
Exchange on July 25, 1936. The required values are given 
in Table 111. 


Table 111 

Comparison of Preferred and Common Stocks in Respect of Price 

Variation 


Common 
stocks 
Preferred 
stocks 
(seven 
per cent) 


Degrees 

of 

freedom 

tit) 


65 


65 


Simi of jMean Common Natural 

squares of square Starutard ^^Sarithn logarUhm 

demaiions deviaUmi deimtion 

froni {variance) o- standard standard 

mean 0-2 detriaiion deviation 

logma log^a 

99,327.28 1,528.112 39.09 1.59207 3.66590 


30,812.20 474.034 21.77 1.33786 3.08056 

Difference = 0.58534 


Each distribution includes 66 observations. (It is not 
Bssentiai to this comparison that the number of observations 
ill the two distributions be equal.) In computing the mean 
square deviation we divide the sum of the squared deviations 
from the mean by n, the number of degrees of freedom, 
f^hich is here equal to one less than the total number of 
ibservations in each distribution, that is, to N-h (More 
s said below about the determination of number of degrees 
)f freedom.) The standard deviation of the common stocks, 
59.09, is materially greater than the corresponding figure, 
ll , 77 for preferred stocks, but we cannot tell by inspection 
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whether the difference is significant, or whether it merely 
reflects a fluctuation of sampling. A precise test may be 
made by using the coefficient z as a measure of the difference 
invariability. 

This coefficient is equal to the difference between the 
natural logarithms of the two standard deviations. That 
is 

Z — logetTl — logeO-a. 

It is to be noted that natural logarithms are to be employed. 
Comnion logarithn^ on the base 10 may be shifted readily 
to natural logarithms on the base e (2 . 71828) by using the 
factor 2.3026 as a multiplier. From the entries in the 
last column of Table 111 we derive . 58534 as the value 
of z. 

If common and preferred stocks were alike, with respect 
to tlu! dispersion of their prices, and if we had sufficiently 
large samples so that sampling fluctuations did not affect 
the niea.sures of variance, the value of z would be zero. 
Is the value we have derived consistent with the hypothesis 
that the true value of z is zero? Could sampling fluctuations 
alone account for a deviation as great as .58534 from a 
true value of zero? If the derived value of z is too great 
to be attributed to sampling fluctuations, the hypothesis 
that common and preferred stocks are alike, with respect 
to the dispersion of their prices, is untenable. 

To determine whether the derived value of z is consistent 
with the hypothesis that its true value is zero, we must 
know something about the distribution of values of z, if 
these were computed from many samples drawm under 
the same conditions. Fisher has shown that this distribution 
is normal, or effectively so, when the two distributions 
being compared both include a large number of observations. 
This is also true when the two distributions include only 
a moderate number of observations, but with ni and 
^ ^ual or nearly equal. The standard deviation of a dis- 
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tribution of z’s secured under these conditions, or the 
standard error of z, is a function of the two n’s. It may be 
derived from the relationship 


r 2(^,7, + 

where m and ns are the number of degrees of freedom in 
the two distributions. 

In the present example ni and rtj are both equal to 65; 
the standard error of z is equal to the square root of the 
reciprocal of 65. We have 

= VTbTsgg = . 124. 

The test of the hypothesis that the true value of z is zero 
reduces, then, to the question whether a value of . 58534 
is likely to be drawn from a normally distributed population 
with a mean value of zero and a standard deviation of 
.124. A value of .58534 represents a deviation of 4.72 
standard deviations from zero (i.e., z/<r^ = 4.72). A devia- 
tion as great as this occurs so seldom, in random sampling, 
that we may not accept the conclusion that the present 
value represents a chance deviation from zero. The result 
is not consistent with the hypothesis that the true value 
of z is zero. The dispersion of common stock prices is 
significantly greater than the dispersion of the prices of 
preferred stocks paying seven per cent dividends. 

To exemplify a different condition, we may compare the 
dispersion of prices of preferred stocks paying six per cent 
and of preferred stocks paying seven per cent dividends. 
We have 64 quotations on the former, 66 quotations on 
the latter, both relating to closing prices on the New York 
Stock and Curb Exchanges on July 25, 1936. The figures 
are given in Table 112 on page 494. 

In this comparison the value of z is .02890. The standard 
error of z (the square root of half the sum of the two recipro- 
cals) is . 12502. The coefficient z deviates from zero by 
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Comparison of Six Per Cent and Seven Per Cent Preferred Stocks 
in Respect of Price Variation 

^ o Mean Natural 

Degrees Sum of Standard logarith?n 

of squarmoj deviation of standard l/n 

dozens (.J ^ 

(») frammrm 

Seven per cent 
preferred, 

stocks 66 30,812.2 474.034 21.77 3.08066 . 0163846 

Six per cent 
preferred 

stocks 63 28,175.0 447.222 21.16 3.05166 . 0158730 

■ Difference » 0.0,2890 Sum « .0312576' 


an amount equal to about one fourth of the standard error 
of 2 ( 2 /cr* = .23). This, of course, is a deviation that would 
occur very frequently in a normally distributed variate 
with mean value of zero. The result is, therefore, consistent 
with the hypothesis that the true value of z is zero. There 
is no sigjiificant difference between six per cent and seven 
per cent preferred stocks in respect of the dispersion of 
their quoted prices. 


The Testing of Variability between Classes 

The comparison of standard deviations provides a means 
of answering questions of another type. Measurements of 
changes in the average selling prices of products of manu- 
facturing industries may be used to exemplify the procedure. 
If we classify manufacturing industries into those producing 
perishable, semi-durable, and durable goods, and compute 
an average of changes occurring between 1929 and 1933 
in the selling prices of the products of each of these categories, 
we obtain the index numbers given in Table 113. 

The average decline in prices was much less among durable 
manufactured goods than among goods of the other classes; 
scmi-durable goods suffered the greatest loss. The range 
of variation among the three averages is considerable, but 
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Table 113 


Measurements of Average Changes in Selling Prices^ 1929-1933, in 

Three (Groups of Mamifactming Industries 


Class of ind-ndrij 

i¥o, of 

index of selling prices 

industrim 

1929 

1933 

Prociiicing perishable goods 

M 

100 

69.81 

Producing semkliirabie goods 

26 

100 

66.41 

Producing durable goods 

25 

100 

78.96 

Ail iiidiistries 

85 

100 

71.46 


on the basis of the evidence here given we are not able to 
say whether the observed differences are due to chance, 
merely, or whether the prices of these several classes of 
goods were subject to the play of quite different forces, 
during the period here covered. An objective test is needed, 
before we may assume that the observed differences are 
significant. 

For the application of such a test we need a measure of 
variation which is independent of the principle of classifica- 
tion here employed. How much might a series of price 
relatives for 1933, on the 1929 base, be expected to vary 
as a result of the play of chance? (By “chance” we here 
mean the mass of causes unrelated to the factor of relcttive 
durability.) A measure of the strength of such causes is 
provided by the variation within the three classes we have 
set up. The method used in measuring the variation within 
these classes is indicated in Table 114 on page 496. 

It will be understood that the deviations which, in 
squared form, enter into the sums in the last colunm are 
the differences between individual items and the means 
of the classes in which those items fall. Thus the relative 
measuring the average selling price of products of the meat 
packing industry in 1933 was 44.90 on the 1929 base. 
This industry falls in the perishable goods group. The 
difference between 44.90 and 69.81 is 24.91. The square 
of this, or 620.5081, is one of the 34 items making up the 
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. Table 114 

Illustrating the Measurement of Variation within Classes 


(1) 

(2) 

(3) 

(4) 



Mean of price 

Sum of squares of' 

Class ' indudry 

No, of 
indtmtries 

relatives 
(1933 on 1929 

deviatimis of indwlduol 
price 7'elatives from 



base) 

ckm mean 

.Producing perishable 




goods 

Producing semi- 

34 

69.81 

6,464.0275 

durable goods 
Producing durable 

2() 

66.41 

3,375, 1849 

goods 

25 

78.96 

5,725.6916 

All industries 

85 


15,564.9040 


figure 6,464.0275, the entry for perishable goods in the 
last (‘olumu of the preceding table. The sum of the entries 
in this lust column, 15,564.9040, represents variation in 
price changes idthin the thi-ee classes. It is not influenced 
by fa(‘l()i-s of perishability or durability, since the total is 
affected only by variation among perishable goods, variation 
among scmii-durable goods, and variation among durable 
goods. 

Eighty-five items enter into this total. However, only 
82 degrees of freedom are present. The 34 perishable 
goods possess 33 degrees of freedom to vary, the 26 semi- 
durable goods possess 25 degrees of freedom, and the 25 
durable goods possess 24 degrees of freedom. For the 
standard deviation defining variability within classes we 
have, therefore 

, . = 13.78. 

This figure provides us with a yardstick, a measure of the 
degree of variation that is independent of the principle 
of classification employed in distinguishing perishable, semi- 
i durable, and durable goods. This measures the variation 
■ due to the mass of floating causes known as chance. 
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With this standard we may compare the differences 
between the three class averages presented in Table 113. 
The magnitude of these differences may be defined by 
a single measurement, a standard deviation. In its computa- 
tion the deviation of each class mean from the grand mean 
is measured, and the square of this deviation is multiplied 
by the number of items in the class in question. The proce- 
dure is illustrated in Table 115. 


Table 115 

lUndrating the Measurement of Variaiion between Classes 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Class of 
i ltd list ri/ 

No, of 
industries 

M can 
of price 
relatives 
(1920 
= 100) 

Deviation of 
dass mean 
from mean of 

nil 

S(pme of 
deviation of 
dass mean from 
mean of all 

Weighted 

squared 

deviation 



an> 

observations 

observations 

(A? 

(2) X (5) 

P.rodiieing 





perishable 

goods 

34 

69.81 

— 1.65 

2.7225 

92.5650 

Producing ■ 






semi" 

clurabie 






goods 

Producing 

26 

66.41 

— 5.05 

25.5025 

663.0650 

durable 

goods 

25 

78.96 

H” 7 . 50 

56.2500 

1,406.2500 


2,161.8800 


The sum of the entries in column (6), 2,161 .8800, measures 
the total variation between classes. Although weights are 
used in getting this total, the differences relate to three 
separate averages, only, and but two degrees of freedom 
are represented in the total. As a measure of the degi’ee 
of variation between the three broad categories we have 
set up, we have 
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This we may take as a measure of the difference in degree 
of change in selling price from 1929 to 1933 that appears 
to be related to the relative durability of products. 

The next step involves the formal testing of this figure 
against the standard provided by the measures of variation 
within classes. This test is applied in Table 116. Certain 
necessary calculations are also indicated. 


Table 116 

Comparison of Measures of Variation 


Degrees „ , Mean 

Nature of of j square Standard j , 

variability freedom i deviation daiation 

n (r“ a 

Between 
classes 
Within 
classes 
Total 


2 2,161.8800 1,080.9400 32.88 1,51693 3.49288 

82 15,564.9040 189.8159 13.78 1.13925 2.62324 

17,726.7840 Difference = z = 0,86904 


The test reduces, it is clear, to a comparison of two 
measures of variability. One, the standard of comparison, 
is the measure of variance within classes, a measure com- 
pletely independent of the perishability or durability of 
the product. The other is a measure of variation between 
classes. Such variation might be due to the same general 
mass of causes responsible for variation within classes, or 
it might be due to special forces related to differences in 
the durability of the goods in question. If the former 
explanation is correct, the two measures of variation should 
be of the same order of magnitude, with due allowance 

'The figure 17,726.7840, which is the sum of the squared deviations of 
individual observations from their respective class means and of the squared 
deviations of the several class means from the mean of all the observations, is 
equal to the sum of the squared deviations of all the individual observations 
from the m^n of all the observations. In the table the total has been broken 
iRto two components, repreeenting variability between classes and variability 
wftldn: claiim. 
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for sampling fluctuations. If the second is the correct ex- 
planation the two measures of variation may differ appreci- 
ably in magnitude. The test, therefore, reduces to the 
question: Is the variation between classes significantly dif- 
ferent from the variation within classes, account being 
taken of the degrees of freedom present in the two cases? 

This question could be answered with reference to the 
standard error of 2 , provided the distribution of 2 be normal, 
or approximately normal. This is the case when the n’s 
that measure the number of degrees of freedom are both 
large or when, though of moderate value, they are equal 
or nearly equal. This condition prevailed in the examples 
cited earlier. It is not met in the present instance, so we 
may not with accuracy employ the method of estimating 
and utilizing the standard error of z that was used in the 
earlier case. When the numbers of degrees of freedom are 
unequal and relatively small, as in this case, tests of signifi- 
cance may be most readily made with reference to a tabula- 
tion of values of 2 , prepared by R. A. Fisher. This tabulation 
gives, for various values of rii and n^, values of z that would 
be exceeded 5 times out of 100, as a result of chance, if 
the true value of 2 were zero ; it also gives one per cent 
values of 2 , i.e., values of z that would be exceeded 1 time 
out of 100, under conditions of random sampling, if the 
true value of 2 were zero. These two sets of values are 
reproduced in Appendix Tables VI and VII of this book, 
through the courtesy of Dr. Fisher and Oliver and Boyd, of 
Edinburgh, his publishers.^ 

In the present example the value of z, defining the degree 
of difference between the two measures of variation in 

‘ Uses of the function z are discussed in R. A. Fisher’s book, Statistical Meth- 
ods for Research Workers, Edinburgh, Oliver and Boyd, sixth ed., 1936. 

A table similar to Fisher’s z-table, but relating to an alternative measure F, 
has been constructed by George W. Snedecor. F is derived directly from the 
variances (i.e., the values of v®) that are being compared; it is the ratio of the 
larger of the two variances to the smaller. For a table of values of F and a 
discussion of its uses see George W. Snedecor, Staiistical Methods, Ames; 
Iowa, Collegiate Press, 1937. 
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Table 116, is .8696. In entering the z-table for the purpose 
of testing this measurement, the number of degrees of 
freedom corresponding to the larger of the two measures 
of variation compared is taken as ui; th is the number 
of degrees of freedom corresponding to the smaller of the 
two measures. This is a necessary procedure, with reference 
to the table as constructed. In the problem that now con- 
cerns us 7ii = 2, n% = 82. 

For ni = 2 and »2 = 60, the 1 per cent value of z is 
.8025; for Ui = 2 and n 2 = <», the 1 per cent value of z 
is .7636. Interpolating, we obtain .7920 as the 1 per cent 
value of z for Wi = 2, rh = 82. V If the true value of z were 
zero, we should expect a value as great as .7920, or greater, 
to occur as a result of chance only 1 time out of 100. The 
present value of z materially exceeds . 7920 ; the probability 
of a value as great as this occurring as a result of chance, 
if the true value of z were zero, is less than 1 out of 100. 
The results of the test, are not, therefore, consistent with 
the hypothesis that the true value of z is zero. The differ- 
ences Ix'twcen the three class averages shown in Table 113 
are too great to be attributed to chance. We may conclude 
that the price movements of perishable, semi-durable, and 
durable manufactured goods between 1929 and 1933 were 
significantly different. 

i Interpolation in the sj-table is based upon direct proportions among the 
reciprocals of the n’s. In the above case 

for ?ii « 2, m = 60: the 1 per cent value of z - .8025 lln 2 == 1/60 * .0167 
for ni « 2, m « : the I per cent value of z =« ' 76S6 llm « 11^ — ;01M) 

■■ A « .0389 ■ 'a « JOiOT 

We must find the 1 per cent value of z corresponding to Ui = 2, ns = 82. 

■ For l/«r.we have 

1/82 == .0122. 

The difference between 1/82 and l/w, for which we must interpolate between 
the given values of is .0122 — .CXIM) » .0122. The required 1 per cent value 

: of® » .7636 + (~|| X .0389) = .7920. 

f ;i; .Tbi proawi of interpolation on tlie rq scale, if required, would be similar. 
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Vaeiange Analysis in the Measurement of ■ 
Relationship 

The procedure employed in the comparison of measures 
of variability is applicable to the measurement of correlation. 
Indeed, using this technique it is possible to employ a 
systematic procedure that is of great value in revealing 
the character and degree of the relationships prevailing 
between variable quantities. This procedure is illustrated 
in the next section. 

The method employed in applying to a typical correlation 
problem the method of analysis based on comparison of 
variances may be illustrated with reference to the data of 
alfalfa yield previously studied. These are presented in 
Table 117. 


Table 117 

Summary of Results Secured in Experinwuts with Alfalfa 

(The measurements in the body of the table measure .yields j in tons per acre, 
ill 44 experiments) 


Inches of irrigation water npplied 


0 

,12 

18 

24 ;■ 

30 

36 

48 

' 60 ', 

2.35 

4.31 

5.69 

6.00 ■, 

7.53 

7.58 

8.05 

■ 5 . 55 

2.75 

4.78 

6.46 

.6.89 

7.97 

8.22 

8.45 

7.25 

2.89' 

4.84 

7.02, 

,7.96 

8.32 

8.63 

8.63 

10,17 

3.85 

5.83 

8.02 

8.32 

9.43 

9.33 

8.81 

10,70 

5.52 

6.51 


8.38 

9.54 

9.38 

9.52 



Average 5.94 .7.52 .. 9.96 11.06 : 12...48 WM / ^ 

, 3.88. ,5.63 :6.80 7.92, . 8.98 9.27 : :"9.02 8,. 42, 7:,. 48.: 

The average yield of alfalfa, in these 44 experiments, 
was 7 . 48 tons per acre. But there was rather wide variation 
among the results. The sum of the squares of the deviations 
of the 44 observations from the mean is 228.33. This 
sum sets our problem. We should like to find reasons 
for this variation. 
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TESTING FOR THE EXISTENCE OF CORRELATION 

The observations are set up above in a form suited to 
the testing of one hypothesis concerning the factors affect- 
ing alfalfa yield. The data are arranged in eight arrays, 
classified according to the depth of irrigation water applied. 
This depth varied from 0 to 60 inches. Variations in yield 
appear to be associated with variations in amount of water 
applied. As a basis for our procedure we set up the hypothe- 
sis that there is no such association. To test this hypothesis, 
we may break the sum that measures the total variation 
of yields into two parts measuring, respectively, the variation 
within arrays and the variation between arrays. 

To determine the total variation idthin arrays, the devia- 
tion of each observation from the mean of the array in 
which it falls is measured. The sum of the squares of these 
deviations, for all the arrays, is the desired total. Thus, 
in the first array of Table 117, the mean is 3.88 tons. 
The deviation of the first observation, 2.35, from this 
figure, is — 1.53; its square is 2.3409. The deviation of 
the second observation, 2.75, is - 1.13; its square is 
1 , 2769. Determining in similar fashion the deviations of 
the four other observations in that array from the mean 
of the array, squaring these, and adding the six squared 
values, we have 11.5320 as the sum of the squares of the 
deviations in the first array. Performing similar calculations 
for the seven other arrays, and adding the eight sums thus 
secured, we have a figure of 76.39. This is the total varia- 
tion within arrays. For convenience we may refer to this 
as component A of the total variation. 

In determining the total variation between arrays, the 
deviations of the means of the various arrays from the 
mean of all the observations are measured and squared, 
and the weighted sum of these squares is secured. Weights 
are based upon the number of observations in the several 
arrays. Thus the mean of the first array, 3.88 deviates 
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from the mean of all the observations, 7.48, by 3.60; 
the square of this is 12 . 9600. Multiplying by six (the 
number of observations in the first class), we have 77.7600. 
Securing similar weighted figures for the seven other arrays, 
and adding, we have 151.94 as the total variation between 
arrays. This we may call component B. 

In breaking up the total variation into two components’^ 
we have distinguished variations in yield that are definitely 
not related to differences in depth of irrigation water applied, 
from variations in yield that may or may not be related 
to irrigation differences. Within the first array, including 
six experiments on plots to which no irrigation water was 
applied, yields varied from 2.35 tons to 5.94 tons per acre. 
The total variation within this array (the sum of the squares 
of the deviations from the mean of the array) amounted 
to 11.5320. Since the irrigation factor was constant, this 
sum measures variation which is completely independent 
of changes in irrigation. This is true also of the figure 
76.39, measuring total variation within all the eight arrays 
set up in Table 117. Differences in soils and innumerable 
minor factors combined to create variation within these 
arrays. The figure 76.39 measures the play of that host 
of undefined forces to which we give the name chance. 
The one specific factor which does not affect this figure is 
irrigation. We have measured the variation in such a 
way that irrigational differences do not enter. 

Irrigational differences do enter definitely into the varia- 
tion between arrays. Indeed, it may be the dominant 
factor in this variation, which is measured by the figure 

i The sum of the two components is, of course, equal to the total. 

Variation within arrays (Component A) 76.39 

Variation between arrays (Component B) 151.94 

Total variation 228.33. 

For a demonstration of this relationship see note, pp. 418-9. 

To ensure full consistency between components A and B and the total 
(and among the sub-divisions of B later defined), when these quantities are 
independently computed, it is necessary that all computations be carried to 
more decimal places than are customarily retained. 
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151.94. But of this we cannot be sure. For the means of 
the eight arrays differ among themselves not only because 
of differences in the amounts of irrigation water applied to 
the different plots. To yield differences due to the irrigation 
factor are added yield differences due to the innumerable 
other forces that influence alfalfa yield, the forces we lump 
together as chance. For chance factors affect the means 
of the various arrays, and so affect the variation between 
arrays, just as they affect the variation within arrays. As 
the experiment was designed, the influence of irrigational 
differences is present only in the variation between arrays, 
but the influence of “ chance ” is present in both the variation 
within arrays and the variation between arrays. 

In this fact is found the key to our problem, and the 
instrument for testing our hypothesis. For, in so far as 
chance alone is operative, the variation between arrays 
would be expected to be of the same order of magnitude 
as the variation within arrays. The figures we have so 
far examined indicate that the variation between arrays 
is gix^ater than the variation within arrays. But this may 
be a purely fortuitous result. The apparent increase of 
yield with increased irrigation may be entirely a chance 
phenomenon, similar to a run of heads in tossing a coin. 
This we must test. We must determine whether the forces 
responsible for variation between arrays are the same as 
the forces responsible for variation within arrays. 

The hypothesis we shall test, and which may of course be 
disproved, is that the forces responsible for variation between 
arrays are the same as the forces responsible for variation 
within arrays; in other words, that there is no association 
between depth of irrigation water applied and alfalfa yield. 
The nature of the test to be applied has been indicated 
in the preceding sections. We shall compare two measures 
of variation, to determine whether they are of the same 
orfler of magnitude. But before this test is applied, account 
■ aaust'be taken of the number of degrees of freedom pre- 
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vailing in each case. This concept calls for brief explana- 
tion. 

If the data of alfalfa yield related to but one plot of 
land, in one year, there would be no variation. A single 
observation would coincide with the mean, and the standard 
deviation would be zero. With a second observation oppor- 
tunity for variation arises. But we may think of it as a 
single opportunity. With but two observations there is 
but one degree of freedom to vary. With three observations, 
two opportunities to vary are given; there are two degrees 
of freedom. In problems of this sort the number of degrees 
of freedom is equal to N — 1. Our present example includes 
44 observations; hence the total variation 228.33 represents 

the resultant of 43 degrees of freedom. 

How are these 43 degrees divided between the two 
components, A and B'? As regards variation within arrays, 
this may be_ readily determined by reference to Table 117. 
Variation within arrays, it will be recalled, was measured 
with reference to the means of the various arrays. In the 
first array, containing six observations, there exist five 
degrees of freedom to vary from the mean of that array. 
The same is true of the arrays relating to 12, 24, 30, 36, and 
48 inches of irrigation water. In each of the arrays relating 
to 18 and 60 inches of water there are but four observations, 
with three degrees of freedom. The total of these degrees 
of freedom is 36. Variation between arrays was determined 
by measuring the deviations of the means of eight arrays 
from the general mean of the distribution. Since eight 
different values are involved, there are seven degrees of 
freedom. (The fact that weights were employed in securing 
the total variation between arrays does not affect the deter- 
mination of degrees of freedom.) The 36 and the 7, combined, 
use up all the 43 degrees of freedom entering into the total 
variation. 

Knowing these degrees of freedom we may now reduce 
the measures of variation within arrays and of variation 
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between arrays to comparable terms, and determine the 
significance of the difference between them. This is done 
in Table 118. This table and others following differ somewhat 
from those employed in similar comparisons in the opening 
sections of this chapter. In the earlier tables variability 
was measured in units of the standard deviation, and the 
function e was derived from the relationship 

2 = logs Vi — logs 0-2. 

It is often more convenient to perform the necessary calcula- 
tions in terms of the variance, that is, of (t^, and to derive 
2 from the relationship 

, _ log, 0 - 1 - log. 0 - 2 ^ 


The procedures lead to the same result, of course, since 
half the difference between the logarithms of the squared 
standard deviations is equal to the difference between the 
logarithms of the standard deviations, but the use of squared 
measurements eliminates one step in the calculation. 


Table 118 

A Test of the Existeme of Correlation 

No, of Mean 

degrees of Sum of sqmre 

freedom squares (variance) 


Natural 
logarithm 
of mean 
square 


Nature of 
variability 


Within arrays 

(Component A) 36 76.39 2.12 0.7514 

Between arrays 

(Component B) 7 151.94 21.71 3.077 8 

Difference = 2.3264 
2 =1.1632 

When we divide the sums of the squares by the corre- 
sponding figures defining degrees of freedom, we have com- 
parable measures of variance. Now it appears that the 
variance between arrays (21.71) is distinctly greater than 
the variance within arrays (2 . 12), in disproof of the hypothe- 
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sis that the same forces account for the two variances. 
But we have a precise test to employ in determining 
whether these two variances are of the same degree of 
magnitude, within sampling limits. This is the coefficient 
g, which is half the difference between the natural logarithms 
of the two variances. In the present case, z is equal to 


3.0778 - .7514 
2 


» or 1.1632. 


If the forces responsible for variation within arrays were 
the same as those responsible for variation between arrays 
(that is, if our hypothesis wei-e true), the value of 2 would 
be zero, with a sample of infinite size. The value of 2 we 
have secured is not zero. This may be proof that our 
hypothesis is false, or it may merely be a result of sampling 
fluctuations. The value of 2 might be zero in a given infinite 
population, but a random sample would be expected to 
yield results deviating considerably from zero. We wish 
now to take account of sampling fluctuations, in determining 
whether the result we have secured is consistent with the 
hypothesis that the true value of z is zero. 

In determining the significance of the present results 
we enter Appendix Table VI with Ui (the number of degrees 
of freedom corresponding to the larger variance) equal to 
7 and na equal to 36. Interpolating in Table VI, we find that 
the 1 per cent value of 2 corresponding to the stated values 
of ni and n-i is .5780.* A value as great as this or greater 


^ It is necessary to interpolate on both scales of the Mable. First, following 
the procedure indicated on a preceding page, we Interpolate in respect of ns. 
We obtain 

for ni == 6, ns = 36, the 1 per cent value of z = .6047; 1/6 == .1667 

for ni = 8, ns = 36, the 1 per cent value of z » .5580; 1/8 . 1250 

d = ^0467 A - .0417. 

We must now interpolate on the ni scale, since the degrees of freedom are 

m ™ 7, no = 36. For 1/ni we have 1/7 = .1429. The difference between 1/7 

and 1/8, for which we must interpolate between the given values of z, is 
.1429 - .1250, or .0179. 

( .0179 \ 

(>fl7 ^ .^^780. 
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would occur only 1 time out of 100, as a result of sam- 
pling fluctuations, if the true value of z were zero. The 
actual value, 1 . 1632, far exceeds the 1 per cent value 
of z. The evidence strongly indicates that z deviates from 
zero not because of the play of chance, but because the 
forces responsible for variation between arrays are of a 
different order from those responsible for variations within 
arrays. We are justified in concluding that our results 
are not consistent with the assumption that the true value 
of 3 is zero. The hypothesis that the forces responsible 
for variation between arrays are of the same character 
as those responsible for variation within arrays is not 
tenable. The results indicate the presence of a real con- 
nection between alfalfa yield and depth of irrigation water 
applied. 

TESTING THE HYPOTHESIS OF A LINEAR RELATIONSHIP 

Since it appears that there is a relationship between 
these two variables, it is now in order to secure an acceptable 
function, defining the relationship in quantitative terms. 
W'e may do this by testing, in turn, various hypotheses 
concerning the form of this function, until we secure one 
with which the observations are not inconsistent. We shall 
start with the hypothesis that there is a linear relationship 
between alfalfa yield and depth of irrigation water applied. 

The first step in applying the present test is to fit a 
straight line to the means of the eight arrays shown in 
Table 117. Variation among these means (component B 
of the total variation) reflects the presence of correlation 

^ Each hypothesis tested should be rational , acceptable on logical grounds. 
If we are thinking of general relationships, prevailing over the entire range of 
possible observation, the assumption of a straight-line relationship between 
alfalfa yield and amount of irrigation water applied is not tenable. For it is 
not to be expected that increased irrigation will increase yield without limit. 

. In the present case we tet the hypothesis of a linear relationship in order that 
" the demonstration of procedure may be systematic and complete, although that 
hypothesis is not a rational one, even within the range of the present observa- 
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between alfalfa yield and irrigation water applied. If the 
correlation is perfectly linear, all these class means will 
fall on the straight line; all the variation between arrays 
will be accounted for by the hypothesis of a linear relation- 
ship. If the relationship is substantially, though not per- 
fectly, linear, the portion of component B not accounted 
for by linear regression will be insignificant. If the regression 
is not truly linear the residue of B not accoimted for (i.e., 
the scatter of the means of the arrays about the straight 
line of regression) will be too great, and some other hypothe- 
sis concerning the character of the relationship between 
alfalfa yield and irrigation water applied must be employed. 

A straight line fitted by the method of least squares 
to the means of the eight arrays Is shown in Fig. 82 on 
page 406. The equation to the line is 7 = 5.038 + .0886Z, 
where 7 is alfalfa yield in tons per acre and X is depth of 
irrigation water applied, in inches. [We should note that 
in the fitting process the mean of each array is weighted 
by the number of observations in that array. This means, 
merely, that six points are assumed to have coordinatejs 
of 0, 3.88 (equal to those of the mean of the first array), 
that four points are assumed to have coordinates of 18, 6.80 
(equal to those of the mean of the third array), etc.] In 
Table 119 on page 510 are given the values of the means of 
the various arrays, and the corresponding computed values, 
as derived from the straight line of regression. 

It is clear from the graph and the table that the fit of 
the straight line to the means of the arrays is not perfect. 
The inadequacy of the fit is measured by the sum of the 
squared deviations of the class means from the corresponding 
computed values (each squared deviation being weighted 
by the number of observations in the given class). This 
smn is equal to 44. 79. 

This sum, to which we may refer as R 2 , is one component 
of B, the variation between arrays. It is that portion of 
the variation between arrays that is not accounted for 
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Table 119 


Alfalfa Yield and Depth of Irrigation Water 
(Class means and values based on linear relationship 
Y = 5.038 + .0886A’) 


(1) 

Inches 

of 

water 

{class) 

(2) 

No. of 
obser- 
vations 

(3) 

Mean 

yield 

of 

class 

(4) 

Estimated 
yield, linear 
relationship 
{t07is) 

(5) 

Difference 
between mean 
y ield of class 
and estimated 
yield 

iYv-y^) 

(6) 

(7) 


f 


Vc 

■ (1 



0 

6 

3.88 

5.04 

■ ..-1.16 

1.3456 

, 8.0736 

12 

., 6 . ■ 

5.63 

6.10 

- ■ .47 

.2209 

.1.3254 

18 

4 

6.80 

6.63 

■. + ..17 . 

.0289 

.1166 

24 

6 

7.92 

7.16 

+ .76 

.5770 

SAm 

30 

6 

8.98 

7.70 

+ L28 

1.6384 

9.8304 

36 

6 

9.27 

8.23 

+ 1.04 

1.0816 

6.4896 

48 

6 

9.02 

9.29 

- .27 

.0729 

.4374 

60 

4 

8.42 

10.36 

- 1.94 

3.7636 

15.0544 

44”7920 


by the hypothesis of a linear relation between yield and 
irrigation water. The method of deriving the other compo- 
nent of B is shown in Table 120. 

The sum 107.15, to which we may refer as Bi, is that 
component of the variation between arrays which is 
accounted for by the hypothesis of linear regression. The 
items in col. (3) of Table 120 differ from 7.48, the 
mean of all the observations, for the reason suggested 
by the hypothesis. They differ, on our present assump- 
tion, because with increased applications of water yield 
increases in a manner defined precisely by the equation 
Y = 5.038 + .0886Z. The sum of these variations, 107.15, 
represents, on this assumption, the full effect on alfalfa 
yield of variations of irrigation applications. 

.The total of the two sums to which we have referred as 
Bi and Ba is equal to 151.94, the total variation between 
arrays. Working on the hypothesis that the variables 
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Table 120 


Computation of Variation in Alfalfa Yield Attributable to Irrigation 
Differences on the Hypothesis of Linear Regression 


( 1 ) 

( 2 ) 

' • ( 3 )' 

m 

( 5 ) 

Difference 

( 6 ) 

( 7 ) 

Inches No. of 

Estimated 

Mean yield, 

between mean 
yield and 
yield esti- 
mated on lin- 
ear hypothesis 



of; 

obser- 

yieldy Umar 

all obser- 



water 

mtmis relaliomhip 
(tons) 

vations 







(lu-y) 




f 

Vc 

Y 

d ^ 

# 

fd^ 

. 0 

6 

5.04 

7.48 

- 2.44 

5.9536 

35.7216 

.. 12 

6 

6.10 

7.48 

- 1.38 

. 1.9044 

11.4264 

18 

4 

6.63 

7.48 

- .85 

.7225 

2.8900 

24 

6 ' 

7.16 

7.48 

. 32 ' 

.1024 

.6144 

, 30 

6 ' 

7.70 

7.48 

+ .22 

.0484 

.2904 

36 

6 

8.23 . 

7.48 

+ . 75 , 

.5625 

3.3750 

48 . 

6 

9.29 

7,48 

+ 1.81 

3.2761 

19.6566 

60 

4 . 

10.36 

7.48 

+ 2.88 

8.2944 

33. 1776 







107.1520 


with which we are dealing stand in a linear relationship, 
we have broken the component JS of the total variation 
into two portions. One of these (Bi) measures the variation 
between arrays that is accounted for by the linear hypothesis; 
the other {Bf) measures the variation between arrays that 
is not accounted for by that hypothesis. We should expect 
some departure from linearity in a sample such as ours, 
even though it were drawn from a universe marked by a 
perfect linear relationship. But there are limits to the 
deviations that might reflect fluctuations of sampling. The 
question we now face is whether Ba is small enough to be 
accepted as the resultant of random factors, or whether 
it is so large as to represent a breakdown of our hypothesis. 

In our earlier discussion we noted that component A 
of the total variation measured the influence of a host of 
random forces affecting alfalfa yield, forces other than the 
irrigation factor. Component A, therefore, serves as an 
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index of the magnitude of random forces, and hence as a 
standard defining the probable limits of sampling fluctua- 
tions, in so far as these are present in component B. We 
may use component A, which relates to variation within 
arrays, as a yardstick in determining whether B2 is attribut- 
able to fluctuations of sampling, or whether it is too large 
to be so explained. 

In comparing components A and £2 account must be 
taken of the number of degrees of freedom present in each. 
This has already been established for A. The following 
tabular sununary of the operations just performed may help 
to explain the relations involved for B2- 


The seven degrees of freedom entering into component B 
are divided, one to component Bi and six to component B2. 
That the points on a straight line vary from one another 
with one degree of freedom is clear from a consideration 
of a linear equation y = a + bx. That the values of y 
may differ is due to the presence of the coefficient b, which 
defines the slope. If b were zero, the equation would define 
a horizontal line, with values of y constant. It is the slope 
that constitutes the one degree of freedom among points 
defined by a linear equation. With respect to S2, we are 
dealing with eight points, to which a straight line has 
been fitted. If there were but two points both of them 
would lie on the line; there would be no possibility of 
deviation. With three points, one degree of freedom to 

: f deviate is introduced; with eight points there are six degrees 
; of fieedom. The degrees of freedom to deviate from any 

, " ' ' ■ 


Sum of 
squares 

107.15 

4.4.79 

15L94 


Nature of variahility 


No. of degrees 
of freedom 


Between arrays, due to linear regres- 
sion (Component Bi) 

Deviations from straight line of re- 
gression (Component B^) 

Total variation between arrays 
(Component B) 
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fitted curve are obviously equal to the number of points 
to which the curve is fitted, less the number of constants 
in the equation to that curve. 

Dividing 44.79 by 6 we may secure, then, the value of 
the variance (the mean square) comparable to the variance 
of component A. A test of om hypothesis again reduces to 
a comparison of variances. This appears in Table 121. 

Table 121 

A Test of the Hypothesis of Linear Relationship 


Nature of variability 

Degrees of 

Mean square 

Natural logarithm 

freedom 

(tmiame) 

of mean square 

Within arrays (Compo- 

n 


lOQe a- 

nent A) 

Deviation from straight line 
of regression (Compo- 

m 

2.12 

.7514 

nent B 2 ) 

6 

7.47 2.0109 

Difference = 1.2595 
. . 2! - .6298 

The variation within arrays reflects the play of random 


factors, independent of irrigation. The force of these factors 
is indicated by a variance of 2 . 12. If similar random factors, 
independent of irrigation, were responsible for the deviations 
of the means of the eight arrays from the straight line of 
regression, we should expect the variance that measures 
such deviations to be of the same order of magnitude. 
Actually it is much greater, 7.47. But we cannot say, 
from inspection, that the difference between the two vari- 
ances is not due to fluctuations of sampling. An accurate 
test is needed. We may compute the coefficient z, half 
the difference between the natural logarithms of the two 
variances, and apply such a test. 

From the values given we secure a value of z equal to 
.6298. In determining whether this value is significantly 
different from zero, use must be made again of Fisher’s 
tables. For the values of ni and 71% are relatively small 
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and unequal, and the distribution of z under these conditions 
would not be sufficiently close to the normal type to justify 
the use of its standard deviation. Entering Appendix 
Table VI with ni equal to 6, to 36, we find that the 1 per 
cent value of z is . 6047. We take this to mean that, if the true 
value of z were zero, random sampling fluctuations would 
be expected to give a value of z as great as .6047, or greater, 
only one time out of 100 trials. The actual value of z in 
the present instance is greater than .6047. Only rarely, 
less frequently than one time out of 100, would chance 
account for a value of z as great as the one observed. We 
conclude, therefore, that random forces, of the type respon- 
sible for variation within arrays, are not responsible for the 
deviations of the means of the eight arrays from the straight 
line of regression. These deviations are too great to be con- 
sistent with the hypothesis that there is a linear relationship 
between alfalfa yield and depth of irrigation water. This 
equation fails to account, adequately, for the observed 
variation between arrays. 

TESTING THE HYPOTHESIS OF A CURVILINEAR RELATIONSHIP 

We may now test the hypothesis that a power curve 
of the second degree (F = a + bX + cX^) defines the rela- 
tion between alfalfa yield and depth of irrigation water 
applied. The procedure is identical with that followed 
in the case of the straight line. By the method of least 
squares we determine the best values of the constants in 
an equation of the desired form. The curve is fitted to 
the means of the eight arrays, each weighted by the number 
of observations in that array. The derived equation is 
F = 3.539 + .2527Z - .002827X2. The curve appears 
graphically in Fig. 82, and the computation of the sum 
of the squared deviations from it is shown in Table 122. 

The inadequacy of the fit is measured this time by the 
figure 4.01, the sum of the squared deviations from the 
power curve of the second degree. This sum, to which we 


further test of relationship 515 


Table 122 

Alfalfa Yield and Depth of Irrigation Water 
(Class means and values based on a power curve of the second degree) 


( 1 ) 

. . ( 2 ) , 

( 3 ) 

( 4 ) 

( 6 ) 

( 6 ) 

( 7 ) 



Mean 


Difference 



Inches 



Estimated 

between 



of 

No. of 

of 

yield, from, 

mean yield 



water 

■obser-. 

class 

equation 

of class 



(class) 

mitions 

(tons) 

(tom) 

and esti- 







mated yield 





y. 

th 

y p Vc 




f 



d 


fd^ 

0 

6 

3.88 

3.54 

+ .34 

. 1156 

.6936 

12 

6 

5.63 

6.16 

-- . 53 

.2809 

1.6854 

18 

4 

6.80 

7.17 

- .37 

.1369 

.5476 

24 

6 

7.92 

7.98 

- .06 

.0036 

.0216 

30 

6 

8.98 

8.58 

+ .40 

.1600 

.9600 

36 

6 

9.27 

8.97 

+ .30 

.0900 

.5400 

48 

6 

9.02 

9.16 

- .14 

.0196 

.1176 

60 

4 

8.42 

8.52 

- .10 

.0100 

.0400 


4.6058 


may refer as Bt, is a component of B, the variation between 
arrays. It is the portion that is not accounted for by the 
hypothesis of a curvilinear relationship, of the type assumed, 
between alfalfa yield and irrigation water applied. The 
other component of B is derived by the method indicated 
in Table 123 on page 516. 

We may designate by Bs the sum 147.32. This is the 
component of the variation between arrays that is accounted 
for by the hypothesis of a relationship defined by a second 
degree curve. The items in col. (3) of Table 123 differ 
from the mean of all the observations, on our present 
assumption, because alfalfa yield varies with increased 
applications of water in a manner defined by the equation 

F = 3.539 + .2527J - .002827Z=*. 

We have again broken B, the total variation between 
arrays, into two components, Bz representing the influence 
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Table 123 

Computaiion of Variation in Alfalfa Yield Attributable to Irrigation 
Differences on the Hypothesis of a Non-Linear Regression 


( 1 ) 

(2) 

(3) 

(4) 

(5) 

( 6 ) 

( 7 ) 

Inches 

No. of 

Estimated 

yield, 

equation of 
second degree 

Mean yield, 




of 

water 

obser- 

vations 

all obser- 
vations 






Vc 

■ ¥ 

yc-Y 




f 



d 



0 

6 

3.54 

7 AS 

- 3.94 

15.5236 

' 93.1416 

12 

6 

6.16 

7.48 

- 1.32 

1.7424 

■ 10.4544 

18 

4 

7.17 

7.48 

.31 

.0961 

■ .3844 

24 

6 

7.98 

7.48 

+ .50 

.2500 

1.5000 

30 

6 

8.58 

7.48 

+ 1.10 

1.2100 

7.2600 

36 

6 

8.97 

7.48 

+ 1.49 

2.2201 

13.3206 

48 

6 

9.16 

7.48 

+ 1.68 

' 2.8224 

16.9344' 

60 

4 

8.52 

7.48 

+ 1.04 

1.0816 

4.3264 


147.3218 


of the irrigation factor, working in accordance with a definite 
law, and Bi representing random factors, or random factors 
combined with the irrigation factor. (The irrigation factor 
enters into Bi to the extent that the hypothesis in question 
fails to take account of the true relation between alfalfa 
yield and depth of water applied.) This is, of course, a 
different division of B from that resulting from the applica- 
tion of a linear hypothesis. The present division may be 
set down in summary. 


Nature of variability 

No. of degrees 
of freedom 

Sum of 
squares 

Mean 

square 

Between arrays, tiue to regression of 
second degree (Component Bs) 
Deviations from second degree curve 

2 

147.32 


of regression (Component Bi) 
Total variation between arrays 

' 5. ■ 

4.61 

.92 

(Component B) 

7 

151.93 



ti . ,,The seven degrees of freedom entering into component B 
; aj?e now divided, five to component Bi and two to compo- 
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nent B 3 . The reasons for this allocation of the degrees of 
freedom are similar to those presented in discussing the hn- 
ear hypothesis. As regards component Bz, the item now of 
chief concern to us, it is clear that when a curve defined 
by an equation with three constants is fitted to eight 
points there are five degrees of freedom to deviate from 
that curve. 

Dividing 4.61 by 5 we secure .92, the value of the vari- 
ance comparable to the variance of component A. For 
again we must use a criterion based on A, in determining 
the limits within which variation due to random factors, 
independent of irrigation, may play. We come again to a 
comparison of variances. 


Table 124 

A Test of the Hypothesis of Curvilinear Relationship 


Nature of variabiUiy 

Degrees of 
freedom 

Mean square 
(piMimiee) 

Nalurai logarithm 
of mean square 


n 


loije 

Within arrays (Ccimpo- 
nent A) 

Deviation from second de- 

36 

2.12 

.7514 

gree curve of regression 
(CoEipoiient Bf) 

5 

,92 

-.0834 



Difference = ~~ .8348 




2 - - .4174 


In this case the degree of deviation from the curve of 
regression defined by the power curve of the second degree 
is actually less than the deviation within arrays, which 
serves as our yardstick. The value of z is therefore negative, 
equal to — .4174. This measure may be tested for signifi- 
cance by the methods previously discussed. The 3-table 
is entered with Ui = 36 (the number corresponding to the 
larger of the two variances), rh = 5. Interpolating in the 
table for these values we obtain 1.1158 as the 1 per cent 
value of 3. The present value is distinctly less than this. 
The difference between the two measures of variance is 
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not significant. The departures from the curve of regression 
may be attributed to “chance,” that is, to random factors 
independent of the irrigation factor. 

In following this general procedure it is necessary to test 
different hypotheses (i.e., different functions) only until the 
difference between the variance defined by component A 
and the variance defining departures from the curve of 
regression be small enough to be attributed to the play 
of chance. Thus, if a P of .05 constitutes om standard, 
the difference between the two variances given in the pre- 
ceding table, as measured by z, might be positive and as 
great as .4536, without leading to rejection of the hypothesis 
being tested. It could be as great as . 6370 if our standard 
of significance were a P of .01.^ A rather exceptionally 
close fit by the second degree curve we have employed gives 
us the negative value of z we have actually obtained. 

We have arrived, then, at an hypothesis concerning the 
relation between alfalfa yield and depth of irrigation water 
applied, with which observed facts are not inconsistent. 
Our observations, be it noted, do not establish the truth 
of this hypothesis. Other hypotheses might be equally ten- 
able, and perhaps even more closely in accord with the facts.® 

^ These figures are derived from Tables VI and VII by the process of intcrpo- 
latioa described above, with % « 5 and ni « 36. (wi is taken as equal to 5, of 
c^-urse, only when Ba is greater than A; is always taken to represent the 
number of degrees of freedom corresponding to the larger of the two variances 
being compared.) This method of interpolation is applicable over the range of 
the 2 «tafole, except for the comer relating to values of ni in excess of 24 and 
values of »s in excess of 30. For dealing with cases in this region, R. A. Fisher 
gives the following formulas for approximating the desired quantities: 

r X 1 i. 1.6449 1\ 

f) per cent value of 2 * - .78431 A. ] 

Vh ~ 1 \«i fh/ 

I per cent value of z ^ - 1.235^— — 

\ni n%J 

In thc« formulas h is the harmonic mean of and Ut, That is 


/ could, of course, fit a curve of still higher degree* the equation to which 
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All that we can say is that the observed , facts do not dis- 
prove the hypothesis. If the hypothesis is tenable on rational 
grounds, we have reached a conclusion upon which we may 
rest, for the time. 


sitmmaky: variance analysis in the measure of 

RELATIONSHIP 

The procedure employed in the last example may be 
summarized and certain measurements presented which 
show the relation of this procedure to methods discussed in 
preceding chapters. The quantitative results are presented 
in Table 125. 

Table 125 

Component Elements of the Variahility of Alfalfa Yields and Various 
Measures of Correlation 

Total variahility of observations 
relaiing to alfalfa yield, and 
components of this total 

Total variability (sum of squared 
deviations from Mean) 228 , 33 

L Division of total variabil- 
ity into: 

A. Variation unrelated to 

irrigation factor (i.e., 
variation within ar- 
rays) 76.39 

B, Variation attributable 

to irrigation factor, 
and to other causes, in 
indeterminate propor- 
tions (i.e. variation be- 
tween arrays) 151 . 94 

{Footnote 2 continued from page 518,) 
contained four constants, or more, instead of the three constants in the equa- 
tion actually employed. The deviations from this curve of higher degree would 
be smaller than from the curve of second degree, and z would be correspondingly 
smaller. It is a principle of scientific procedure, however, to employ the sim- 
plest acceptable function. Needless complexities, whether in the form of un- 
necessary assumptions or of unnecessary constants in an equation of relation- 
ship, are rigorously avoided. 


Test of Measure of 

significance correlaiim 



Correlation ratio 

1.1632 

, 151.94 

1 per cent 

' 228.33 

value of z 

- .6654 

= .5780 

n = .82 
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TabJjB 125— Continued 

Cmiponent Elements of the Variability of Alfalfa Yield, and Various 
Measures of Correlation 


2! - ,6298 
1 per cent 
value of z 
= .6047 


Total mriability of observations 
relating to alfalfa yieldj and significance 

components of this total 

2. Division of component B 
of (1) above into: 

Bi. Variation attributable 
to irrigation factor on 
the assumption of a 
linear relationship 107.15 

B 2 . Variation attributable 
to irrigation factor, 
but not explainable in 
terms of a linear rela- 
tionship, and to other 
causes, in indetermi- 
nate proportions 44.79 
8. Division of component B 
of (1) above into: ^ 

Bz^ Variation attributable 
to irrigation factor on 
the assumption of a re- 
lationship defined by 
power curve of second 
degree 147.32 

B 4 . Variation attributable 
to irrigation factor, 
but not explainable in 
terms of power curve 
of second degree, and 
to other causes, in in- 
determinate propor- 
tions 4.61 


Measure of 
correlation 


Coefficient of 
correlation. 

, 107.15 

^ 228.33 

= .4693 
r = . 69 


2 : =«>;.4174 
. 1 per cent 
value of z 
-1.1158 


Index; of corre- 
lation 
2 ^ 147.3 2 

^ " 228.33 
= .6452 

p = .80 


The meaning of this summary should be clear, with 
reference to the preceding demonstration. Component A 
of the total variability, being independent of the influence 
j of the irrigation factor, is the yardstick, or standard of 
' i reference, which is used in all the tests of significance noted 
‘ in the second column. Component B, in the first test, 
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is shown to be clearly greater than A, when account is 
taken of the number of degrees of freedom present in the 
two quantities. Thereafter, component B is broken into 
sub-components, first on the hypothesis that alfalfa yield 
and irrigation are related by a linear function, next on 
the hypothesis that the relationship is defined by a power 
curve of the second degree. The evidence is not consistent 
with the first of these hypotheses, and it is rejected. (The 
hypothesis would be rejected on rational grounds, as well 
as on the basis of empirical evidence.) The results are not 
inconsistent with the second hypothesis, and we may accept 
it, subject to the possibility of modification on the basis 
of later experience. 

Three abstract measures of degree of correlation between 
alfalfa yield and applications of irrigation water are given 
in the right-hand column. All of these may be derived 
directly from the quantities employed in the variance 
analysis, fetudy of the elements of these correlation meas- 
ures, and of the relation of the several measures to the 
corresponding hypothesis, will provide a suggestive review 
of the general problem of correlation. 

We should note here that an assumption of normality 
is implied in the comparison of standard deviations, or 
of variances, in this type of analysis. Minor departures 
from normality do not materially affect the procediue, but 
substantial departures do so. The conversion to other 
forms (such as logaritlims or reciprocals) of observations 
not normally distributed in natural terms will sometimes 
yield normal distributions. Where this is possible, the 
precision of the method of variance analysis is increased 
by such conversion. Limitations arising out of material 
departures from normality may be avoided, also, by the 
use of ranks, as is done in the computation of the coefficient 
of rank correlation. Appropriate procedures have been 
developed by Milton Friedman.^ 

1 ^^The Use of Ranks to Avoid the Assumption of Normality Inaplicit in the 


522 ANALYSIS OF VARIANCE 

Vaeiance Analysis in Testing the Siqnieicance of 
Seasonal Fluctuations 

The methods outlined in this chapter are applicable to 
certain of the problems encountered in the analysis of time 
series. They are peculiarly appropriate in determining 
whether the seasonal fluctuations observed in a given series 
represent a true seasonal pattern. Apparent seasonal move- 
ments would be present in any series of observations covering 
a period of years, by months. Chance factors would create 
some differences between averages of ail the Januaries, all 
the Februaries, etc., even though no true seasonal movement 
existed. We require an objective test, to be used in deter- 
mining whether the differences among such monthly aver- 
ages are significant or not. 

The entries in the body of Table 126 are the figures 
obtained when freight car loadings by months, for the 
period 1918-1927, are expressed as percentages of linear 
trend values. (The original data are given in Chapter VIII.) 
The arithmetic mean of the ten items for January appears 
in the bottom row, with similar means for the other months. ‘ 
The test for seasonality involves answering the question; 
Do these means differ significantly from 99.9867, the average 
value, of the 120 items in the table? In seeking to answer 
this question we must break the total variance of the freight 
car loadings data into its elements. We wish to define that 
portion of the total variance apparently due to seasonal 
movements. This may then be appraised with reference to a 
yardstick representing what we may call the residual vari- 
ability of the series. 

In computing the total variance we may make use of the 

Analysis of Variance,” Journal of the American Staiidical Association, VoL 32, 
Dee. 1937, 675-701. 

^ Those means, it may be noted, are not precisely the same as the seasonal 
indices given in Chapter VIIL In seeking to improve the representativeness 
of the monthly indices, only the four central items for each month were used in 
the averaging process employed in that chapter. Here it is necessary to employ 
the arithmetic mean of all the items for each month. 


Freight Car Loadings in the United States^ 1918-1927 
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familiar relation c® where d is the deviation 

of an observation from the true mean, d' is the deviation 
from an assumed mean, and c is the difference between true 
and assumed means. In this case we take the assumed 
mean at 0 on the original scale, and c is thus equal to the 
mean. Since we wish to work with sums of squared values, 
we use the relationship 

Sd* = S(d')® - 

(The mean should be computed to more places than are to 
be finally retained, since the process of squaring and multi- 
plying by N greatly magnifies even slight errors.) 

The entries in col. (16) of Table 126 are the sums of 
the squares of the items in the body of the table. Inserting 
the proper values in the above formula, we have 

Sd* = 1,213,250.14 - 120 X 99.9867^ 

= 1,213,250.14 - 1,199,680.82 
= 13,569.32. 

As in the alfalfa problem discussed above, this total 
may be broken into an element representing variance 
between the monthly means and variance within the several 
months. (Reference here is to the columns of Table 126.) 
The variance between months may be computed directly 
from the monthly means. 

Thus: 

Sum of squares of deviations of monthly means from grand mean 
= 10 X (99.9867 - 90. 60)® -f 10 X (99.9867 - 92. 04)^ 

+ 10 X (99.9867 - 96. 38)^ + . . . 

+ 10 X (99.9867 - 88.35)+ 

That is, the deviation of each monthly mean from the 
grand mean is squared and weighted by the number of 
items represented by that mean; the sum of the twelve 
figures thus obtained is the required measure of variability 
! ; s I between months. 
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An alternative shorter method may be employed in 
determining the variance between months, utilizing the 
relationship 

= S(rf')2 _ iVc®. 

Here each d' is the mean value for a given month. Each 
squared value must be weighted by the number of items 
represented by the mean. Thus 

'Sfid'Y = 10(90.60)* + 10(92.04)* + 10(96.38)* + . . . 

+ 10(88.35)* 

= 1,207,068.40. 

The correction factor, Ac*, is the same as in the first opera- 
tion. We have, then, 

Sum of squares of deviations of monthly means from 
grand mean = 1,207,068.40 - 1,199,680.82 

= 7,387.58. 

This sum measures that portion of the total variability 
that may be attributed to seasonal fluctuations. Is it 
significant or does it merely reflect the play of the mass 
of undifferentiated factors we call chance? 

In answering a similar question concerning the alfalfa 
problem we used as a yardstick the variability independent 
of the one factor the effects of which were being studied — 
namely, irrigation differences. In the present case we could 
obtain a measure that is independent of seasonality by 
computing the variability within the several columns of 
Table 126. That is, each item in col. (2) could be deducted 
from the January mean, 90.60, and the sum of the squared 
deviatioirs in this column obtained; a similar sum could 
be obtained for each of the other columns numbered from 
(3) to (13). The grand sum of these figures would be the 
variability within arrays — variability clearly independent 
of seasonal forces since only differences among items for 
the same month enter into it. This sum, measuring varia- 
bility within columns, has a value of 6,181.74. The 
variability between columns plus the variability within 
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columns is, of course, equal to the total. That is 
7,387.58 + 6,181.74 = 13,569.32. 

^ The measure of variability within columns will not serve 
m the present instance, however. The yardstick should be 
a measure of the variability due to “chance”— to the 
play of a mass of random factors which may not be observed 
and measured individually. Effects that can be clearly 
attributed to specific causal forces should not be included 
m the yardstick. But some of the variability within months 
rnay be clearly assigned to changes associated with the 
classification by years. The average of the 12 monthly 
• ao^ ^ 108.09; that for the 12 months in 1921 
is o7.b3. The former was a year of prosperity, the latter 
one of depression. Clearly, some of the differences among 
the items m the January column, or in the May column 
are defimtely attributable to cyclical forces that raise all 
the monthly figures for one year and depress all the monthly 
figures for another year. (The influence of trend is not 
present, since the items in the body of the table are actual 
va nes expressed as percentages of trend.) The variability 
within months should be corrected by the subtraction of 
that portion of it that may be attributed to factors affecting 
yearly conditions as a whole. ® 

The influence of cyclical and other forces affecting whole 

Tm "mrZl between the averfges for 

1918, 1919, 1920, and the other years covered. These 

averages are given in col. (15) of Table 126. The desired 
quantity may be obtained by the precise methods used in 
measuring the variability between months. We have 
Sd* = S(dO^ - NcV 

Sum deviations of yearly averages from grand mean 

(12 X 108.092* + 12 X 98.542* + 12 X 103.267* + ... 
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(There will, of course, be ten squared items within paren- 
theses, one for each of the ten years covered by the data.) 
Subtracting 3,857.06 from 6,181.74, the measure of total 
variability within the columns, we have 2,324.68 as the 
balance. This is the desired yardstick. It measures that 
portion of the variability among the original items which 
is clearly independent of the seasonal influence. Secondly, 
it has been corrected by the subtraction of that portion 
of the variability within months which is attributable to 
cyclical and other factors responsible for broad changes 
from year to year. The final balance represents the play 
of forces independent both of seasonal movements and of 
broad swings affecting each yearly value as a whole. This 
residual variability, measured by the figure 2,324 . 68, reflects 
the play of all those random, undifferentiated factors we 
lump together as chance.' 

This residual variability may be most readily computed by 
subtracting from the total variability the two figures measur- 
ing, respectively, variability between the means of the months 
and variability between the means of the years. At this stage 
of the computation these figures will Imj in the form of sums 
of squared deviations. The form of organization employed in 
Table 126 on page 528 is convenient for these calculations. 

In the application of the test of significance, account 
must be taken of the number of degrees of freedom entering 
into each of these measures of variability. Table 127 indi- 
cates a suitable procedure. 

^ When, as in the present exampie, the influences of the two variables, or 
principles of classification, are independent, it is valid to use the residual vari- 
ability thus computed as a measure of the strength of random factors. If these 
influences are not independent (if, in terms of the above example, seasonal 
movements affecting the monthly averages and cyclical movements affecting 
the annual averages should be correlated), the residual quantity will not 
be an accurate measure of truly random factors. When the residual quantity 
which is used as the yardstick in variance analysis is derived from observa- 
tions that are alike in respect of both principles of elassificatioii (i.e., when 
the (|uantity measures variance within cells obtained by the application of a 
two-fold principle of classification) this difficulty does not arise. An example 
of this type is given in Appendix E, 
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Table 127 

Analysis of Vanance of Freight Car Loadings and Test of Seasonality 

(4) (5) 

Mean square ,, , ,, , 

(vctrimice) loguritfiM 

-2 . of Mean squoire 


Nature of 
mriability 


Between means 
of years 

Between means 
of months 
Residual varia- 
bility 
Total 


J3,482 3 J 5027 

Difference = 3.35343 
3,35343 


- 1.67671 

The Item 3,857.06 measures the degree of difference 
between 10 yearly averages. Nine degrees of freedom are 
represented in this figure. (The use of weights in computing 
he sum of the squares does not affect the number of degrees 
of freedom ) Similarly, U degrees of freedom are repre- 
sented m the measure of variability between the 12 monthly 
meaM. The total variability is computed from 120 items 

0 degrees of Worn m the residual variability is, therefore, 

The vanance between the means of months (i.e., the 
mean square) is 671.598. The residual variance is 23 482 
Ihe test of seaaonaUty reduces to the question: May the 
variance between the means of months be attributed to 

Po^ible for the residual variance? 
Jantr f? between the monthly means is signifi- 

1 . T *be residual variance, no significance 
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The test is applied with reference to the measure z, 
which is equal to half the difference between the natural 
logarithms of the two variances being compared. From 
the entries in Table 127 we compute z as equal to 1.67671. 
Referring to Appendix Table VI we find that for Ui = 11 
and fta = 99, the 1 per cent value of z is approximately .44. 
The present value is distinctly greater than this. The 
results are not consistent with the hypothesis that the 
true value of z is zero. There is clear evidence of the existence 
of a definite seasonal pattern in freight car loading.' 

The same yardstick may be applied in testing whether 
the differences between the yearly averages are significant. 
The rather wide variations from year to year in the average 
values of the items in the body of Table 126 represent, 
presumably, the play of cyclical forces plus major “acci- 
dental fluctuations” affecting yearly totals. (The trend 
factor, had it not been eliminated, would have combined 
with these other two to create differences among the yearly 
totals.) But are these year-to-year differences great enough 
to be attributed to definite forces other than the chance 
factors represented in the residual variance we are using 
as yardstick? 

The variance between means of years is equal to 
3,857.06 -V- 9, or 428.562. Is this significantly greater than 
23.482, the residual variance? Following the procedure 
illustrated in Table 127 we obtain 1.35352 as the value 
of z. The 1 per cent value of z, for rti - 9, = 99, is 

approximately .47. The test indicates that the differences 
between the annual averages are due to definite forces 
other than the random factors represented in the residual 
variance. 

^ In the test here applied we are proceeding on the assumption that the 
seasonal pattern is constant from year to year. If it is not constant, the ac- 
curacy of the residual variability, as a measure of ** chance’^ factors, and ,of 
the measure of variability between months will be affected, and the signlff- 
cance of the results will be lessened. If there is reason' to believe that seasonal 
movements have changed over the period covered, tests of the kind suggested 
in Chapter Vlll should precede the tests here discussed. 



530 


A N A L Y S I S 0 F VAR I A N G E 


references 

Fisher, R. A., Statistical Methods for Research Workers. Chan s 
Fisher, R. A., The Design of Experiments. Chap 4 
Fnedman, Milton, “The Use of Ranks to Avoid the Assump- 

onh Implicit in the Analysis of Variance, ” Journal 

of the American Statistical Association^ Dec 1937 

. Snedecor, G. W, “Analysis of Variance as 

an Effective Method of Handling the Time Element in Certain 

LrSrch S"'’” 

Snedecor G W, Methods. Chaps. 10-12, 15. 

ippett, L. H. C., The Methods of Statistics. Chap. 6 . 

Sap! if <« Thmm of 


CHAPTER XVI 


THE MEASUREMENT OF RELATIONSHIP: 
MULTIPLE AND PARTIAL CORRELATION 

In dealing with methods of defining correlation in the 
preceding chapters we have been concerned with problems 
involving only two variables, a dependent variable and a 
single independent variable. We have found, in certain 
cases, a fairly high degree of correlation between the two 
variables studied. But it is obvious that, in general, 
economic phenomena are affected by more than one factor, 
that the fluctuations in a single variable may be due to 
the interaction of many forces. In dealing with just two 
variables all other factors are ignored, on the assumption, 
usually, that in the single independent variable are found 
the most important causes* of fluctuations in the dependent 
variable. Thus, in the alfalfa example given, the effect 
upon yield of but a single factor, irrigation, was studied. 
Yet variations in rainfall and temperature must have 
affected the yield in the different years studied. Similarly, 
variations in practically every factor dealt with in economic 
analysis are traceable to more than one cause. If our 
analysis is to be complete we must employ methods which 
will enable more than two variables to be handled at a 
time. We need instruments that will assist us in measuring 
the combined effect upon a single variable of a number 
of factors. Such instruments may be secured by a simple 
extension of methods already familiar. 

In Table 128 are presented figures showing the yield 
of corn, per acre, in Kansas from 1890 to 1933, together 

^ This should not be taken to mean that the coefficient of correlation meas^ 
ures or establishes causal relationships. 
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Table 128 


( 1 ) 

Year 


1890 

1891 

1892 
189 S 

1894 

1895 

1896 

1897 

1898 

1899 

1900 

1901 

1902 

1903 

1904 

1905 

1906 

1907 

1908 

1909 

1910 

1911 

1912 

1913 

1914 

1915 

1916 

1917 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

1928 

1929 

1930 

IWI 

im 

im 


Corn Yield and Temperature in Kansas^ 1890--1933 

... 


( 2 ) 

Average 
yield per 
acrsj in 
bushels 
Xi 

15.6 

26.7 

24.5 

21.3 

11.2 

24.3 

28.0 

18.0 

16.0 

27.0 

19.0 

7.8 

29.9 

25.6 

20.9 

27.7 

28.9 

22.1 

22.0 

19.9 

19.0 

14.5 

23.0 

3.2 

18.5 

31.0 

10.0 

13.0 

7.1 

15.2 

26.5 

22.2 

19.3 

21.7 

21.7 

16.6 
H.O 

30.0 

27.0 
17.5 

12.0 

17.5 

18.5 

11.5 


Average 

June 

tempera-^ 

hire 

77.6 

70.7 

73.4 

74.7 

74.2 

71.7 

74.1 

76.6 

75.0 

73.9 

74.9 

77.3 

70.9 

67.2 

70.4 

75.5 

71.8 

72.0 

72.1 

73.1 

72.2 

80.5 

69.3 

74.2 

78.2 

69.2 

70.3 

72.8 

78.4 

72.3 

72.8 

74.4 

75.2 ■ 

73.3 

74.3 

77.7 

72.5 

70.9 

67.7 

72.2 

73.1 

78.1 

74.3 

80.5 


Average 

July 

tempera'- 

lure 

Xa 

83.1 

74.0 

77.5 

79.5 

77.8 
■ ■■ 74.9 

78.1 

80.2 

77.7 

76.2 

77.9 

85.0 

76.8 

78.3 

75.6 

74.5 

73.8 

78.4 

75.8 

78.1 

79.5 

78.6 

79.9 

82.1 

79.9 

74.0 

81.2 

80.8 

78.3 

80.2 

77.6 

79.2 

77.0 

79.4 

75.1 

79.7 

78.4 

76.9 

78.1 

78.8 

81.7 

80.6 

81.8 

81.4 


Average 

AugvM 

tempera. 

ture 

X, 

76.1 

76.5 

73.8 

78.0 

76.0 

78.7 

76.0 

78.2 

80.6 

81.0 

79.1 

78.2 

75.3 

74.6 

78.7 

76.3 

78.1 

76.2 

80.1 

75.7 

76.4 

77.4 

84.2 

78.2 

70.1 

79.6 

73.4 

82.3 

78.3 

72.9 

78.6 

80.1 

78.3 

79.0 

77.4 

79.1 

73.1 

77.1 

78.9 


oj, 4 ■■ ■ 

^ Bureau for Dod^atyrSrf^^d .report^, of the. U. 8, 

- ’ ^ '' ' 


80 ,.3 
fO . l '..' 
f9.2 
76.8 

and from the 
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with the average June, July, and August temperatures for 
each of these years. 

The Relation between Coen Yield and Tempeeatitre : 

Preliminary Analysis 

It is known that corn yield is affected by the temperature 
during the growing season. The object of the present 
study is the determination of the precise relation between 
yield and temperature during each of the three months 
given, in order to secure a basis for estimating the yield 
from a knowledge of the temperature. As certain growing 
months are more important than others, the relation 
between temperature and yield may be determined, first, 
for each of the three months separately. 

The equation which de.scribes the relationship between 
yield per acre and June temperature will be of the type 

A i = “b 512A2. 

The equation describing the relation.ship between yield per 
acre and July temperature will be of the type 

= a + tisA's. 

(In each case Xi represents average corn yield per acre, 
for the State, while Xa, Xn, etc., represent the absolute 
temperature, in degrees Fahrenheit.) Instead of using to 
represent the variables the symbols 7 and X, as in the 
preceding examples, Xi, Xa, X 3 , etc., are employed, Xi 
representing in this case the dependent variable. The 
symbol for the constant representing the slope (the coefficient 
of regression) is, in the first instance above, 612 . The 
subscripts 1 and 2 indicate the variables to which this 
constant refers, the first subscript always representing the 
dependent variable (Xi in the example cited), the second 
the independent variable (X 2 in the illustration above). 
These subscripts are necessary to distinguish the different 
constants when several variables enter into the problem. 
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The meaning is precisely the same as in the former examples 
when no subscripts were needed because only two variables 
were dealt with. 

Solving the proper normal equations for the constants 
in the equation which describes the average relationship 
between yield per acre and June temperature, we have 

Zi = 100.35 - 1 . 096 X 2 . 

The value of Sn may be determined from the formula 
™ _ S(Xi^) - aS(X,) - bi2S(X,X2) 


(The subscripts to S, and those to r which appear below, 
have the same meaning as those employed in the preceding 
paragraph.) Substituting the given values, and solving, 
we have 

= 33.593 

and 

Sn = 5.80. 

The significance of the standard error, <S, as a measure 
of the reliability of estimates based upon the equation of 
relationship, has been fully explained. In judging of the 
usefulness of the equation, fSi 2 should be compared with ci 
(the standard deviation of Xi) which may be looked upon 
as a measure of the reliability of estimates based upon the 
arithmetic mean of the variable Xi. For this we have 


Clearly, the estimates from the equation are more reliable 
than those based upon the mean. The coefficient of correla- 
tion, r, expresses this relationship in abstract terms. We 
may get this value from the equation 

r* _ aS(X0 + bn'^jXiX,) - 
“ S(Xi*) - JVcx* 

Solving for r, and giving it the sign of bn, we have 
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These values indicate a negative correlation, though not 
a high one, between yield per acre of corn and June tena- 
perature in Kansas. Let us see if the estimates could be 
improved if based upon the temperature in July instead 
of in June. 

The values needed in this study may be computed from 
Table 128. Solving for the constants in the equation of 
regression, we secure the equation 

X, = 166.07 - 1.866X3. 

For the standard error, we have 

5,3 = 4.81 

and for the coefficient of correlation 

ri3=-.6948. 

We have here a closer relation and a better basis for 
estimates than in the case when June temperature was 
considered. 

Repeating the process for yield per acre and August 
temperature, we have 

X, = 119.45 - 1.288X4 

5,4 = 5.78 

* ri4 = - .5013. 

August temperature, it is evident, also affects the corn 
yield in Kansas, a low temperature conducing to yield 
above normal. The relationship is not so close as in the 
case of July temperature, but it is still significant. What 
is needed now is some method of combining these three 
factors, in order that an estimate may be based upon a 
knowledge of their influence, in combination, upon the 
yield of corn. The addition or averaging of the temperatmres 
in the three months will not do, for July is obviously more 
important than either of the other months. The principle 
of the method by which this may be accomplished is simple. 
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The Estimation op Coen Yield peom Thkee 
Independent Vaeiablbs 

The estimating or regression equation m the present case 
will be one in which there is a single dependent variable 
(corn yield) and three independent variables. It will be 
of the form 

Xi == a + bn.iiXi + 613.24X3 + 614.23X4. 

If we can determine the values of the four constants, we 
may substitute given values of X2, X3, and X4 in the equa- 
tion and thus get an estimate for Xi in precisely the same 
way as when two variables are dealt with. The method 
of least squares affords the means of solving for the required 
constants. 

^ The symbols require a word of explanation, as a perfectly 
simple equation is given a rather ponderous appearance 
by all the subscripts employed. The s3nnbol 612, it has been 
explained, represents the coefficient of regression of Xi on 
(i.e., the slope of the line describing their relationship, Xi 
being dependent) when these two variables alone are 
included in the study. The symbol 612.34 represents the 
coeffid^ of net regression of Xi on Xj. The addition of the 
subscripts 3 and 4 to the right of the period means, simply, 
that the variables Xs and X4 have been included in the 
study and the effects of their variations eliminated, in so 
far as this one constant (6,2.34) is concerned. This constant 
me^ures the weight which must be given to the variable 
X2 in an estimate of Xi based upon the three independent 
variables, Xs, Xs, and X4. It will not, of course, be the 
same m bn, which indicates the weight given to X2 when 
•an estimate of X, is based upon X2 alone. Similarly the 
constant 613.24, the coefficient of net regression of Xi on X3 

fTf? are also 

mauded. Each coefficient represents a single, simple con- 

bat the subscripts are necessary in order that the 
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precise meaning of this constant may be clear. The subscripts 
to the left of the period are termed primary subscripts, 
those to the right secondary subscripts. 

FORMATION AND SOLUTION OF THE NORMAL EQUATIONS 

The first task^ is the securing of the normal equations 
required in solving for the constants in the estimating 
equation given above. Following the usual procedure® we 
have: 

I S(Zi) = Aa + hn.u'SiX,) + bn.n'S(X,) + bu.^'EiXd 

II SCZiZs) = aS(Zj) + bn.u'S:(Xi‘) + h.^SiXA) 

+ bn.2iS(X^X,) 

III ^(AiAa) = (ih{Xs) -f- 5 i2.3.iS(Z 2Z3) "b 6i3.s42I(A3®) 

+ 6h.23S(Z3X4) 

IV S(Z,Z4) = aS(Z4) + 6 i2.34S(Z2Z4) + 6i3.«S(Z3Z4) 

+ 5i4.23S{Z4®). 

The given values might be substituted in these simul- 
taneous equations and solutions secured directly for the 
four constants. It is possible to reduce the number of 
normal equations by one, however, and thus lessen mate- 
rially the labor of computation. This is done by using 
deviations from the arithmetic mean for each variable 
instead of absolute values, getting rid in this way of the 
constant term a in the original equation. 

If we let Ai, As, As, etc., represent the arithmetic means 
of the different variables while Xi, Xs, Xs, etc., represent 
deviations from the means, we may replace the absolute 
numbers Xi, Zs, Za, etc., by their equivalents, Xi + Ai, 
X 2 + As, X 3 -f As, etc. Making these substitutions in the 
normal equations, certain algebraic simplifications are pos- 

^ The approach to the problem of mi^tipie correlation which Is here taken 
follows that of H. R. Tolley and M. J. B, Ezekiel Method of Handling 
Multiple Correlation Problems/' Journal of the American StatisHcal Associa- 
tionf December, 192S, 993-1003. 

2 Cf. Appendix A for a discussion of this procedure and of the methods em- 
ployed in simplifying the normal equations. 
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sible which eliinmate the first of the normal equations, 
and reduce the others to the following form: 






T 013.24 T Ol4,23 


2(2:12:3) _ ^( ZiXs ) , 
N ~ N ■ 


2(xi2:4) 

N 


I h I h 

'12.34 i — 013.24 T 014.23 

S(X2X4) ,, , 2(x32:4) , 2(x 4‘‘) t 

012.34 + — - 013.24 + — jyT— O 14 . 23 . 


All the variables in the above equations refer to deviations 
from the respective arithmetic means. Therefore 

is simply the mean product of the variables Xi and x^, 

S(x 

^ * Ms 0 - 2 ®, etc. Representing the various mean products 


N 

by the symbols p^, pn, etc., and inserting the symbols 
for the squares of the standard deviations, we secure, for 
the normal equations: 

Pn — + P 23 & 13.24 + ^24614.23 

Pl3 = Ptsbn.M + <r3%l3.24 + /I 34 &I 4.23 

Pl4 = P245l2.34 + Pa4&13.24 + 0’4‘^6l4.23. 

This is the most convenient form for the solution of the 
normal equations. 

From the data, as arranged in Table 128, the following 
values are derived: 

2(Xi) = 863.9 
2(X2) = 3,241.5 
2(Z3) = 3,463.4 
2(X4) =3,409.1 

2(XiZ*) 

2(AiX3) 

2(XiZ4) 


2(Ai2) = 18,928.17 
2{A2“) = 239,209.57 
2(1V) = 271,317.92 
2(A4=) = 264,433.19 
63,198.42 
67,295.48 
66,550.84 


‘Vl'V V'V A.O 
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Cl 


,S(Ai) 


N 

= 19.6341 
C2 = 73.6705 
C3 = 78.4864 
Cl = 77.4795 


Ci^ 


385.4979 


02 ^ = 5 , 427.3426 
c.,2 = 6,160.1150 
04 * = 6,003.0729 

From these values, the quantities necessary for the solu- 
tion of the normal equations may be readily determined. 
These quantities are brought together below: 


2 S(Zi*) 

= —A? Cl* 


N 

18,928.17 

44 


385.4979 = 44.6878 


, 239,209.57 

= 44 ■ 

- 5,427.3426 = 9.2385 

„ 271,317.92 

= 44 - 

- 6,160.1150 = 6.2013 

, 264,433.19 

44 ■ 

- 6,003.0729 = 6.7723 

S(ZiZ2) 

= - -- 

C 1 C 2 


AQ 1QQ 49 

= 1,446.45396 - - 10.1263 

44 ’ 

Pi3 = _ 1,541.0098 = - 11.5671 


Pu = - 1,521.2403 = - 8.7213 

5,782.1323 = 2.9808 
5,707.9535 = 2.1951 


PS3 

P2i 


44 

254,544.98 

44 

251,246.54 


44 


P34 = - 6,081 . 0870 = 1 . 8586. 
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Substituting in the normal equations, we have : 

- 10.1263 = 9.23856i2.34 + 2.98O8&13.24 + 2.I95I61423 

- 11.5671 = 2.9808612.34 + 6.20136i3.24 + 1 . 8586614.23 

- 8 . 7213 = 2 . 195I612.34 + 1 . 85866i 3.24 + 6 . 77236i4.23. 

Solving these simultaneous equations^ we secure the fol- 
lowing values for the constants: 

612.34 = - 0.460 6i31!4 = - 1.420 614.23 = - 0.749. 

The required equation is, therefore, 

xi =- 0 . 460 a ;2 - 1 . 420 a :8 r- 0.7493:4. 

This is the equation of regression of on x^, x^, and 3 : 4 . 
Any given values of the three independent variables (June 
temperature, July temperature, and August temperature) 
may be substituted in this equation, and the most prob- 
able value of the dependent variable (corn yield per acre) 
determined. In the equation as it stands, it should be noted 
all the variables are expressed as deviations from their re- 
spective arithmetic means. For practical purposes it is ad- 
visable to have an equation in terms of the original values. 
In other words, it is desirable to shift the origin from the 
point of averages to the zero point on the original scales. 
11 ns necessitates re-introducing the constant term a 
The value of a may be determined from the equation 

Ai = a -I- A 26 «. 34 -f A3613.24 + A4614.23 

where the A ’s represent the respective arithmetic means. = 
inserting the proper values, we have 

the absolute numbers X, X,, etc. by their equivalents 

Jit, etc., we secure .... 

Zfe) + NA, - Xa 4 - 6i2.42(*,) + NA,] + + X 4 I.] 

Sue® Sfassl n» A \ A X .1 +■ ^^14.28[S(a;4) -f iV'44l. 
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19.6341 = a + 73.6705(- 0.46005) + 78.4864 (- 1.41967) 

+ 77.4795(- 0.74910).‘ 

Solving, 

0 = 222.99. 

The equation of regression in terms of original values is, 
therefore, 

Xi = 222.99 - 0.460X2 - 1 .420Xs - 0.749X4. 

COMPUTATION OF THE STANDARD ERROR OP ESTIMATE 

Are estimates based upon this equation any more reliable 
than those based upon the equations previously derived, 
each of which referred to a single independent variable? 
To answer this question the value of the standard error 
must be computed. This will be represented in the present 
case by 81 .^ 4 , the subscripts referring to the single dependent 
variable (Xi) and the three independent variables. This 
value may be computed from the formula^ 

^ Tile arbitrary origin is at zero on eacli of the original scales, hence /It » Ci, 
A% * Css, etc. To ensure greater accuracy in solving for a, the values of the co« 
efficients h^M, ^>13.24, etc., are given to a greater number of decimal, places than 
in the equation of regression. 

^ This formula may be derived as follows: Given an equation of the tyj>e 
Xx « huMXi Hh hn,ux& + hu.uXi 

(in which the variables refer to deviations from the means) each residual may 
be computed from the equation 

d hl 2 MX 2 -h h^.uXz 4“ — Xi. (1) 

Multiplying throughout by d, and adding, we imve 

S(d2) « 5i2,34S(da:2) + hu,tiHdx$) -f 614.232; (da-i) - Mdxi) 

but it follows from, the method of fitting that 

saQ 

^{dx%) 

Z idXi) « 0 . . ■ ■ 

and, therefore, 2;(d^) =* — (2) 

Multiplying each residual equation (1) by xi and. adding, we have 

2;(da;i) « 612.342 (a;ia:,2) 4* 'bu.t 4 ^(xix$) 4 614.232 (afiXi) 2(xi‘'*). 

Substituting the e.quivalent of 2(da:i) in equation (2) we secure 

{Fmtmte mdinued on pc^e 




Sh.2M = 44.6878 
= 17.0746 
< 511.234 = 4 . 13 .* 


4.6586 - 16.4215 - 6.5331 


This is to be interpreted just as the standard error of estimate 
was interpreted in previous cases. The reliability of estimates 
based upon the mean value of Xi is measured by <ri, which 
has a value of 6.68. The reliability of estimates based 
upon the equation of net regression, when yield is considered 
as a function of temperature in June, July, and August, is 
measured by Sum which has a value of 4.13. It is clear 
that estimates made from the equation are distinctly more 
reliable than those based upon a knowledge of Xi alone. 
We have by no means accounted for all the factors that 
are responsible for variability in corn yield, but we have 
measured and reduced to precise terms the effect of three 
factors upon the yield of corn per acre in Kansas. 


(Footnote 2 continued from page 541.) 


i42(A*r3) - 




5sr — 012 . 34 - — 5^—' ■ 

N N N 




'^(XiXi) 


N N 

Sin(^ the variables refer to deviations from the means, we iiave 
SS.2U O'!"* hm.upn ~~ bu.uPn ”** bu.2zPu‘ 

Se^i Appendix A for a general derivation of these relations. 

* For precise WT>rk, when the sample is small, allowance should be made in 
computing aS" for the number of constants in the equation of regression. Since 
tliere are four constants in the present equation, the 44 observations have but 
40 degrees of freedom to deviate from the computed values. Denoting by S 
the corrected value of the standard error of estimate, and by m the number of 
constants in the equation of regression, Ezekiel gives 

\N -mj 

this correction to the present measurements, we have 




542 ■ ■ MULTIPLE COERELATION 


S\,2U == CTi^ ““ &I 2 . 34 P 12 — i!>13.24Pl3 ?>14.23pl4. 

Substituting the proper values, we have 
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THE COEFFICIENT OP MtTLTIPLB COEBELATION 

We have need now of our third measure, the abstract 
coefficient of correlation. The value of this coefficient, as 
we have seen, depends upon the relation between S and a. 
It may be computed in the present instance from the 
formula 


J?2 _ 1 S^l.234 

■K 1.234 = 1 S-- 

When the relationship between a single dependent variable 
and several independent variables is being studied, this 
measure is termed the coefficient of multiple correlation 
and is represented by the symbol R. The subscript to 
the left of the period relates to the dependent variable, 
while those to the right relate to the independent variables. 
Substituting in this formula the equivalent of S^i.zu, we have 


R\.'. 


,234 


bi2MPn 6i3.24Pi3 buMpu 




which reduces to^ 


Rhm == 


bn.u Pn + fei3.24p j3 + bii,npu 


Inserting the proper values we have 


Rh.m = 


4.6586 + 16.4215 + 6.5331 


RhMi ^ .6179 
i^l.234 ~ .786. 

For the same reason that estimates of p computed from 
samples must be corrected by making allowance for the 
number of constants in the regression equation, correction 

^ The coefficient of , multiple correlation may also- be derived from the general 
formula, which refers to an origin at zero on the original scales. This genera! 
formula is 

Bh,m . . . n ' ■ , 

oS( . nX(XiXt)^bn.u . . ■ n^(XiXi)^hu.n . . > nS(XiZi)4 . . . 


544 MULTIPLE GORKELATION 

must be made in R. For if the number of constants is 
equal to the nui^er of observations, R will necessarily 
equal 1. Using R to denote the corrected coefficient of 
multiple correlation and m to denote the number of con- 
stants in the equation of regression, Ezekiel gives 

In the present example 

F.-1- {(l-.6I79)(g^)} 

_ = .6892 

R = .768. 

In later references to this illustration the uncorrected 
measure is used, though it is, to be understood that the 
corrected measure provides a somewhat closer approximation 
to the true R than does the uncorrected coefficient. 

The coefficient of multiple correlation is an index of the 
degree of relationship between a single dependent variable 
and a number of independent variables, in combination. 
It measures the degree to which variations in the dependent 
variable are related to the combined action of the other 
factors. Its significance may be clearer if all the independent ' 
variables are looked upon as constituting a single independ- 
ent series. The coefficient is then seen to be a measure 
of the relationship between the dependent variable and the 
independent series, which is precisely what the coefficient 
of correlation is in the simpler case of two variables. In 
the multiple case the independent series has several com- 
iwnent elements, but this fact does not alter the essential 
agmncance of the coefficient. No positive or negative sign 
; IS at^hed to R, it should be noted. In the present instance 
aU of the independent variables are negatively correlated 
' wth com yield, and a negative sign might be attached, 
iae correlation could be positive, however, for some of 
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the independent variables, and negative for others. Because 
of this fact, R is always given without sign. The signs of 
the constants in the equation of net regression show which 
of the independent variables are positively correlated and 
which are negatively correlated with the dependent variable. 

The sampling error of the coefficient of multiple correlation 
may be estimated from the formula 

1-K* 

OR = 

viv — m 

where m is the number of constants in the equation of 
regression. A more accurate test of the significance of R 
may be applied with reference to Fisher’s z-table, discussed 
in Chapter XV. The deviations of actual from computed 
values serve as a yardstick for testing the variability in 
Xi that is attributable to X 2 , Xg, and Xt, as the relationship 
is defined by the equation of regression. In common with 
other correlation problems, this one reduces to a comparison 
of variances. 

The sum of the squares of the deviations of the observed 
values of Xi from the computed values is 751.2824. The 
sum of the squares of the deviations of the computed 
values of Xi from the mean value of Xi is 1,214,9808. 
Since there are 44 observations, and since the equation 
of regression contains four constants, there are 40 degrees 
of freedom in the deviations from the regression function. 
The three coefficients of regression (other than the con- 
stant a) give three degrees of freedom to variation among the 
computed values of Xi. The test takes the following form. 


Nature of mriaMUty 

Degrees of 
freedom 

Sum of 
squared 
deviatmis 

M.ean 
, square 

Log^ cr® 

Variation among computed 
.■■¥alueS' 

Deviation of observed from 

S ■ . 

1,214.9808 

404.9936 

6.0039 

computed values 

40 ' 

■ ■751.2824 

■ 18.7821 

2.9329 


, 43 

L966 . 2632 Difference = 

= 3.0710 




1.5355 
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For ni = 3, ^2 = 40, the 1 per cent value of z, as derived 
from Appendix Table VI, is .7308. The present value is 
greatly in excess of this. The variation in Zi attributable 
to the influence of X^, X^, and X^ is clearly greater than 
the residual variability here used as the yardstick. The 
measure of correlation, R, is unquestionably significant. 

COMPAEISON OF MEASIJBES OF EELATIONSHIF 

The degree to which our knowledge of the causes of 
variation in corn yield has been improved and the reliability 
of our estimates increased by taking account of the various 
factors in combination may be more readily appreciated 
if we bring together the various measures secured in the 
course of this analysis. 


Table 129 

A Comparison of Certain Measures Pertaining to the Corn 
Yield in Kansas 


Bcms of estimate 

Arithmetic mean of A""! = 19.63 
Ai = 100.35 - 1.096X2 
A, = 106.07 - I.S66A5 
Ai = 119.45 - 1.288A« 

A, = 222.99 - O.46OA2 - I.42OA3 
-O.749A4 


Measure of 
reliability 
of estimate 

0-1 = 6.68 
-S,2 = 5.80 
(S13 = 4.81 
Su = 5.78 


Coefficient 

of 

correlation 

m = - .4984 
ru = - .6948 
ru = - .5013 


<Si,234 = 4.13 Rum = .7861 

The value of might be further reduced and the value 
of R correspondingly increased by bringing into the analysis 
other factors, such as rainfall during the growing months. 
The method which has been explained may be extended 
to cover any number of variables, one equation being added 
to the set of simultaneous equations for each additional 
'V^ariable introduced* 


LINEAR RELATIONSHIPS 
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THE METHOD OF MULTIPLE COERELATION VALID FOR LINEAR 
RELATIONSHIPS 

One important condition has not been emphasized in the 
course of the preceding discussion. The validity of this 
method of multiple correlation depends upon the existence 
of a linear relationship between each pair of variables. 
Thus with four variables there were six pairings possible 
(i.e., six mean products were computed). If there had been 
a material departure from linearity in any of these six 
relationships the significance of the results would have been 
decreased. There would be no fallacy involved in the use 
of the equation under these conditions, but it would not 
furnish as good a basis for estimates as one which took 
account of the true relationship. In such a case the values 
of S and R would indicate that the estimates based upon 
the assumption of linear relationship were not very reliable.^ 

AN APPLICATION OP THE METHOD 

Let US illustrate the use of the estimating equation. 
In the year 1933 the average June temperature in Kansas 
was 80.5° F., the average July temperature was 81.4° F., 
and the average August temperature was 76.8° F. What 
was the probable corn yield per acre? Substituting these 
values for Xa, Xa, and X^ in the equation, 

Xi = 222.99 - 0.460X2 - 1.420X3 - 0.749Xi 
we have 

X 2 = 222.99 - (0.460 X 80.5) - (1.420 X 81.4) 

- (0.749 X 76.8) 

Xi = 12.85. 

The estimated yield for 1933 is thus 12 . 85 bushels per acre. 

^ An approach to problems of multiple correlation when the relationship 
between the subordinate series is non-linear is explained by M. J. B. Ezekiel in 
the Journul of the American SicdiBtical AMsouation^ Vol. XIX, N. S. No. 148, 
1924 , and in his book Methods of Correlation Analpm, 
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What are the limits within which we may expect the 
actual 5 deld to fall, with respect to this estimate? The value 
of Si .234 is 4.13 bushels. This means that the odds are 
68 out of 100 that the actual yield will be within the 
limits 8.72 bushels (i.e., 12.85 — 4.13) and 16.98 bushels 
(i.e., 12.85 +4.13). The actual yield in 1933 was 11.5 
bushels per acre. 

In this illustration we have used one of the years in- 
cluded in the study. The same method would be employed 
in making an estimate for a future year. (Additional ele- 
ments of uncertainty are introduced, of course, whenever 
results secured for one period are applied to another time 
period.) Thus, from the temperatures in 1936 (76.7° in 
June, 85.5° in July, and 84.4° in August), an estimate 
of 3 . 1 bushels per acre is yielded by the regression equation 
employed above. This was a summer of exceptional heat 
and drought. The actual yield was 4.0 bushels per acre. 

The Meaning op Partial or Net Correlation 

In the preceding section we have sought to determine 
the degree to which corn yield in Kansas is affected by the 
temperature in June, July, and August, treating the three- 
independent variables in combination. Our aim has been 
to measure their combined effect upon corn yield. There 
is a related problem, which in many studies may be of 
major importance. This is the determination of the rela- 
tionship between a dependent variable and a single indepen- 
dent variable when all other f odors included in the study are 
held constant. Concretely, what would be the effect upon 
corn yield of variations in July temperature, if June tempera- 
: , lure and August temperature could be held constant? This 
; is the problem of net or partial correlation. 

> It is obvious that if a method could be developed by 
; > which two variables could be isolated for separate study, it 
; );Wottld add immeasurably to the analytical powers of the 
; ? etfoBclmst, and of social scientists in general. It would give 


nature OF NET CORRELATION 549 

to the student in these fields that power to ehnunate irrele- 
vant influences and to concentrate his attention upon a single 
factor which is possessed by the chemist, for example. In 
studying the effect of one element upon another the chemist 
seeks to eliminate all other elements, and the effectiveness of 
his analysis depends in large part upon the degree to which it 
is possible thus to isolate the object of immediate interest. 

It is not generally possible in economic analysis to 
eliminate all but one of the factors responsible for variations 
in a given series. The direct and indirect causes of a given 
economic phenomenon are too numerous and too complicated 
in their interaction for the economist ever to hope to emulate 
the chemist in reducing his problem to terms of but two 
variables. But, within certain limits, the statistician is 
able to employ the method of the physical scientist in 
holding constant certain factors while the effects of varia- 
tions in another are studied. The methods which make this 
possible are among the most powerful of the instruments 
which the student of the social sciences possesses.^ 

The method of partial correlation may be explained with 
reference to the problem of corn yield in Kansas. Our 
object is to determine the net correlation between corn 
yield and the temperature in each of the three months 
for which the average temperature is given. 

distinction between PAKTIAIi AND SIMPLE COEEELATION 

It is important to distinguish between this problem and 
that faced in the ordinary measurement of relationship 
between two variables. We have already secured, as a 
description of the average relationship between corn yield 

and July temperature, the equation 

J, = 166.07 - 1.866X3 

with 
and 


Si3 = 4.81 
ris = — .6948. 
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These measures describe the relationship in question when 
all other factors are ignored. They are not taken account 
of. They are merely neglected. It is as though the chemist, 
in studying the reaction of one 'element to another, used 
a test tube containing various impurities, which he made 
no attempt to remove. The economist cannot, in general, 
locate and remove all the “impurities” in his problem, but 


he should recognize that his measures relate to such 



uncorrected data, 


The Method op Pabtial Coeeelation 

In seeking to determine the net correlation between corn 
yield and July temperature we attempt to secure a measure 
of the correlation which would prevail if other factors might 
be held constant. We shall take full account of the other 
factors we have studied, but we shall try to secure a meas- 
ure influenced only by fluctuations in July temperature, in 
relation to corn yield. 

()n(i pos.sible method of accomplishing this end may be 
suggested. If one possessed data covering a very long 
period we might be able to pick out a number of years 
during which the average temperatures in June and August 
remained unchanged. Let us say that we could find thirty 
years in all, during each of which the June temperature 
averaged 74° and the August temperature 78°. Corn yield 
and July temperature varied during these years. The re- 
lationship between July temperature and corn yield might 
now be measured, and it would be certain that the results 
would not be affected by the presence of fluctuations in 
June temperature and August temperature. Unfortunately, 
this method of holding certain factors constant cannot be 
employed. The data are too limited and too varied, in 
l^neral, to enable us to pick from among them such figures 
as are appropriate to our purpose. Other methods of 
^arriving at the same end are available, however. 

’} ; As a first step, let us derive the equation defining the 


ialBi 
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relationship between corn yield as dependent variable and 
June temperature and August temperature as independent 
variables. This will be of the form 

= a -f- 612.4X2 ■+• 614.2X4. 

We solve for the constants exactly as in the preceding 
example, except that variables Xi, X2, and X4 only are; 
employed. The desired equation is 

Xi = 160.97 - 0.856X2 - 1.010X4. 

We may determine the value of the standard error of 
estimate from the relation 

S\m - o-i“ - bn.iPn - buspu- 

We secure 

/S‘*i.24 = 27.2112 
Si.24 = 5.22. 


If corn yield per acre is estimated from June temperature 
and August temperature the standard error of estimate, 
or the standard deviation of the remaining variability, is 
5.22 bushels. But we know that if corn yield is estimated 
from June, July, and August temperature, the standard 
error of estimate, or the standard deviation of the remaining 
variability, is 4.13 bushels. The measure of remaining or 
“unexplained” variability is reduced from 5.22 to 4.13 
by the addition of July temperature (X3) to the estimating 
equation, after account has already been taken of the 
influence of June temperature (X2) and August temperature 
(X4). The difference between these two measures may be 
taken to represent a relationship between Xi and X3 which 
is not affected by variations in X2 and X4. 

We have seen that the degree of correlation between a 
dependent variable (Xi) and an independent variable (Xa) 
may be defined by the relation 

-Sh, 
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Here the denominator of the fraction in the right-hanc 
member defines the original variability of Xi, while the 
numerator of that fraction defines the variability of Xj 
after account has been taken of the influence of X3. In 
the present problem we have 

„ 23.1134 _ 


The coefficient of correlation is given the sign of hn, the 
coefficient of regression. 

In exactly the same way, we may say that the net 
correlation between X, and Xs, when the relationship is 
not affected by fluctuations in X* and X*, is defined by the 
relation 


Here the denominator of the fraction in the right-hand 
member defines the variability remaining in X, after account 
has been taken of the influence of X^ and X4, while the 
numerator defines the variability remaining in Xi after 
account has been taken of the influence of Xa, X3, and X4. 
Numerator and denominator differ only because of the -present 
of correlation between Xi and X3 that is incremental to any 
correlation that may exist between Xi on the one hand and 
Xi and X4 on the other. If the equation 

Xi = 222.99 - 0.460X3 - I.42OX3 - 0.749X4 

gives estimates no more reliable than those derived from 
the equation 

Xi = 160.97 - 0.866X2 - 1.010X4 

numerator (5*1.234) and denominator (5*1.24) of the above 
OTiCtion will be equal, and r*i3j4 will have a value of zero. 
mi if the equation containing X*, X*, and X4 as independent 
wriaWes gives better estimates than does the equation 
containing only Xa and X 4 , the numerator will be smaller 
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than the denominator, and r®is .24 will have a value other 
than zero. If the estimates based upon the three independ- 
ent variables are in exact agreement with observed yields, 
Sh.s3i will be equal to zero, and r*i 2.34 will have a value of 
unity. 

Employing the values derived above, we have 


r‘'*13.24 = 


17.0746 

^.2112 


.3725 


ri3.s4 = — .610. 


The coefficient of net correlation, ri 3 . 24 , is negative, having 
the same sign as the coefficient of net regression, 613 . 24 - 

The quantity ri3.24 measures the degree of correlation 
between Xx and X 3 when neither one is affected by variations 
in Xs and X4. It may be thought of, equally well, as a 
measure of the degree to which errors in estimating Xi 
are reduced when use is made of Xs, after full account has 
already been taken of the influence of Xj and X 4 on Xx. 

The meaning of the symbols employed in the above 
demonstration should be clear from the context. As in 
the coefficients of net regression, the first of the subscripts 
to the left of the point (the primary subscripts) refers to 
the dependent variable ; the second of the primary sub- 
scripts refers to the single independent variable to which 
the measure of net correlation applies specifically. The 
subscripts to the right of the point (the secondary sub- 
scripts) indicate the variables which are held constant for 
the purpose of the particular comparison being made. The 
number held constant is two in the present case, though 
it might be one, or any other number. Thus the general 
formula for the coefficient of net correlation between vari- 
ables Xx and Xa would be 


. . . B 


^h.g348« 


n 


n 


The variable that is present in the numerator and absent 
in the denominator is the particular independent variable 
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that is being paired with the dependent variable for the 
purpose of measuring net relationship. 

The coefficients of net correlation between and each 
of the other independent variables may be derived in similar 
fashion. Thus 


r^l2.34 = 1 


rht.23 = 1 




S\.3 

S\ 


.234 


Sh 


,23 


In each case the difference between numerator and denomi- 
nator of the fraction in the right-hand member measures 
the net reduction in the variability of Xi which is associated 
with a relationship between Xi and a single independent 
variable, account having already been taken of the influence 
of two other variables. 

It is clear that such measurements as these are net only 
with respect to the variables represented by the secondary 
subscripts. The coefficient measures the degree of 
relation between X, and Xs when X) and A'4 are held con- 
stant. There may be many other factors affecting Xi and 
Xii the disturbing influences of such factors have not been 
eliminated. These other factors still muddy the water of 
analysis. Ignoring them is not the same as holding them 
constant. Only by direct measurement and inclusion in 
the study, as was done with X3 and X4, may the influence 
of additional variables be effectively eliminated. 

Another Method of Computing Coefficients of 
Partial Correlation 

Obviously a whole series of coefficients of net correlation 
may be computed in dealing with a number of variables. 
In deriving a number of such measurements a method may 
be utilized which differs somewhat from that employed 
above, and which has certain advantages in the way of 
systematic arrangement. 
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A simple coeflScient of correlation relating to but two 
variables is termed a coefficient of zero order. Such coefficients 
are represented by symbols of the type r^, ru, etc. Coeffi- 
cients of net correlation which relate to two variables, 
while a single additional variable is held constant, are 
termed coefficients of the first order, and are represented by 
symbols such as ri 2 . 3 , r24.3, etc. Similarly, we may have coeffi- 
cients of the second, third, fourth, or nth order, depending 
upon the number of variables held constant while the relation- 
ship between a single dependent and a single independent 
variable is being measured. 

It is possible to derive each coefficient of partial correla- 
tion from those of the next lower order. Thus a coefficient 
of the first order may be derived from the relation 


ns.s = 


'^12 ri3 • r23 


For a coefficient of the second order 


_ ^12.3 *““ ^ 14 . 3 * r24.3 

(I - (1 

As a general equation for a coefficient of net correlation 
of any order, ^ we have 

_ r 12.3 4 6 ■ ■ . (n-1) ~ rin.,')45 ■ ■ ■ (n-1) • T2n.34S . ■ ■ (n-1) 
ri2.345 . . . n • 

Thus it is possible, starting with the zero order coefficients 
of correlation, to compute all higher order coefficients 
successively. The mere arithmetic of calculation would be 
laborious, but certain prepared tables reduce these computa- 

^ It will be noted that in an equation used in computing a coefficient of 
partial correlation the three r’s in the numerator of the right-hand member 
have the same secondary subscripts, and that these secondary subscripts are 
one less in number than the secondary subscripts of the left-hand member; 
that the first r in the numerator has the same primary subscripts as the left 
hand member; that the second and third r's in the numerator have primary 
subscripts composed of one of the primary subscripts of the left-hand member 
plus the missing secondary subscript; that the two r*s in the denominator are 
the same as the second and third r^s in the numerator* 
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tions to a minimum.* The method may be illustrated, usin^ 

the data of the preceding problem. ^ 

In the present case we require three coefficients of the 
second order, 7-13.24, and rn.sa. These will serve as 
measures of the net correlation between corn yield and 
temperature in each of the three critical months The 
formula from which the first of these measures may be 
computed was given above. For the second, we have 

, (1 -rV2)i(l -rV2)i 

and for the third 

ri4.2 ~~ ri3.2 • r43.2 


ri 4.23 


(T- rh,.2)i' 

But each of these values may be derived from a slightly 

Afferent grouping of first order coefficients. We may use 
the three formulas 


ri 2.34 


(f 


^Xg.4 n3,4 • r23.4 




ri 3.24 


a 


:. 4 )» (f 

ri3.4 — ri2.4 


7 - 32.4 


ri 4.23 


- r“i 2 . 4 )^ 0 
ri4.3 — ri2.3 * r42,3 


(1 — r“l 2 . 3 )^ (i — ?’^2.3)*' 


By empl^ng both methods in computing each second 
order coefficient a cheek upon the calculations is afforded. 


COMPUTATION OF FIRST ORDER COEFFICIENTS 

The second order coefficients cannot be (-omputed until 
a I necessiry first order coefficients have been secured. 
Ihe necessary equations, of the type 

ri2 — ns • r 23 


^ 2.3 


S^Lrtiffi from the general formula for coefficients 

be employed. 
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Table 130 

Illustrating the Computation of First Order Coefficients of Partied 

Correlation 


(Kansas com yield and temperature) 


r 0 Order 

1 

Product 




r 1st Order 

Sub- 

Coe/- (1 ~ 

term of 

wnoie 

nmnerafjor 

Denom- 

inator 



Coef- 

script 


%mu 


numerator 



script 


ficient 

12 


.4984 



2736 

— 

.2248 

.6611 

12. B 

...™. 

.3400 

13 


. 6948 ' 

.7192 









;23, 

+ 

.3938 

.9192 









14 '■ 


.5013 



. 1993 


.3020 

.6890 

14.3 


.4383 

13 


(5948 

.7192 









43 

+ 

.2868 

. 9580 









■24 

+ 

. 2775 



.1129 


1646 

8806 

24.3 


. 1869 

2:i 

+ 

. 3938 

. 9192 









43 

-f 

.2868 

. 9580 









13 


.6948 



1963 


.4985 

. 7969 

13.2 


, 6255 

12 


.4984 

. 8669 









32 

+ 

.3938 

, 9192 









14 


.5013 



1383 

— 

.3630 

8329 

14.2 



.4358 

12 

— 

.4984 

.8670 









42 

+ 

.2775 

.9607 









34 


.2868 


-f 

. 1093 

+ 

.1775 

.8831 

34.2 

+ 

.2010 

32 

+ 

.3938 

.9192 









42 

+ 

.2775 

. 9607 









12 


,4984 



. 1391 ■ 



.3593 

.8313 

12.4 

— 

.4322 

.14 , 


.5013 

.8653 









:24' 


.2775 

. 9607 









:,':13: 


.6948 


— 

. 1438 



. 5.510 

.8290 

13.4 


.6647 

14- 

„ — 

.'5013 

.8653 









, 34 

■'.+ 

.2868" 

.9580 









23 

+ 

.3938 



.0796 

+ 

.3142 

.9204 

23.4 


.3414 

24 

+ 

.2775 

. 9(507 









34 


.2868 

, 9580 
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The procedure in computing each first order coefficient 
IS simple. Three zero order coefficients are necessary for 
each calculation. These should be arranged in the table 
in the order in which they occur in the numerator of the 
fraction from which the required coefficient is to be com- 
puted. The numerator of this fraction is secured by sub- 
tracting from the first zero order coefficient the product of 
the other two. This product term appears in one column 
of the table. The denomina tor of t he fraction is the product 
of two terms of the type v'l - derived from the second 
and third coefficients in each group of three. The tabular 
arrangement of Table 130 on page 557 permits these com- 
putations to be carried forward systematically. 

The^ coefficient rn.A is, of course, identical with 
T3i.2 is identical with r 43 . 2 , etc. It is unnecessary to duplicate 
the work of computation with respect to these measures. 

COMPUTATION OP SECOND ORDER COEFFICIENTS 

Jrom these first order coefficients the three required 
second order coefficients may be secured by methods analo- 
gous to those employed above. The computations are 
shown m Table 131. As a check upon the calculations each 
required measure is computed from two different combina- 
tions of the first order coefficieots. 

. The value of ri3.24, it will be noted, is the same as that 
derived from the relation between and Si. 2 m. 

■ 4 ?^ of such coefficients as these was explained 

in the earlier section dealing with this problem. The follow- 
ing summary of results reveals the gain in knowledge which 
has resulted from the above analysis. 


ri2 = - .4984 
ru == - .6948 
ru = - .5013 


ri2.34 = - .2923 

ri3.24 = - .6101 

ru.n = - .4057 


It IS clear that the net effect of June temperature upon 
com yield is distmctly less than was indicated by the simple 
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Table 131 


lUvstrating the. Computation of Second Order Coefficients of Partial 

Correlation 

(Kansas com yield and temperature) 


f , 1st Order 

Sub- Coef- 
script jUdenl 

(1 

Product 
term of 
numerator 

Whole 

niinierator 

Denom-^ 

mator 

r 2ml Order 

Sub- Coef- 
script ficient 

12.3 

,14.3 

;24.3 

- .3400 
-- .43.83 
+ . 1869 

.8988 

.9824 

- ,0819 

- 2581 

.8830 

12.34 - 

- .2923 

13.2 

14.2 

34.2 

- .6255 
»» .4358 
+ .2010 

.9000 

.9796 

- .0876 

. -- .5379 

.8816 

13.24 - .6101 

14.2 

13.2 

43.2 

~ .4358 
-- ,6255. 
+ .2010 

.7802 

.9796 

- .12.57 

-- .3101 

.7643 

14.23 - .4057 

12.4 

13.4 

23.4 

- .4322 

- .6647 
+ .3414 

.7471 

.9399 

- .2269 

-- .2053 

.7022 

'12.34 - 

- .2924 

13.4 
,12.4 

32.4 

.6647 
- .4322 
,+ .3414 

.9018 

.9399 

- .1476 

.5171 

.8476 

13.24 .6101 

14.3 

12.3 

42.3 

-- .4383' 
- .3400 
-f . 1869 

.9404 

.9824 

- .0635 

-- . 3748 

.9238 

14.23 - .4057 


correlation. This is so because there is a positive correlation 
between temperature in June and temperature in July and 
August, so that the crude correlation of two variables 
alone shows June temperature as more important than it 
really is. For the same reason, all the net coefficients are 
less than the simple coefficients, though it is still apparent 
that July temperature is far more important, in relation 
to corn yield, than the temperature in either of the other 
months. 
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The coefficients of net correlation are net, of course 

only with respect to the variables actually taken account 

of, and held constant. Thus there may be other factors 

such as rainfall in June, July, or August, which affect 

corn yield and which are correlated with the temperature 

unng these months. Were these included the various 

coefficients of net correlation might have different valued 
from those given. • vaiutt, 

The sampling error of a coefficient of partial correlation 
may be estimated from the same general relations that 
hold for zero order coefficients, except that the factor W - 1 
must be fimther reduced by the number of variables held 
constant. Thus for r, 2.34 we have 


A MEA8UEB OF VARIABILITY 
Having these coefficients of net correlitinn 
m^sure of some importance may be computed.’ This Is 
aumberof 1'^,™ a single character wht a 

question miVhf «'®^^tant. Thus the 

^rature in Kansas in June, July, and August, what would 

con St ; corn yield? In ofher wordsrlf we 

ould eliminate such variability in corn yield as is Wtto i 

rthe^d“of?‘’T‘TV would remj 

deviation of o,d^“ " ' ' “ " “ -’‘“^ard 

This measure may be computed from the general equation 


(1 - rVaa . . . „-i). 
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= 44.6878[1 -(- .4984)“][1 -(- .6255)11 -(- .4057)-i 
cr\.-i34 = 17.0797 

cri .234 = 4 . 13 . 

Referring back to the discussion of this problem we find 
that the values of 0 - 1.234 and 81.334 are identical. That is, 
the standard deviation of variable Xi, when variables X 3 , 
Xs, and X 4 are held constant, is merely the standard devia- 
tion of observed values from computed values of Xi. It 
is the standard error of estimate, when estimates are based 
upon the factors X2, X3, and X4. The reason for this is 
obvious. The variability of the original series is reduced 
to the extent that estimates based upon the equation of 
relationship approximate the actual values. The variability 
which remains is due to differences between these estimates 
and the actual values. But these differences are merely 
the residual deviations, from which S is computed. A re- 
alization of the identity of these two measures may assist 
in making their meaning clear. 

Since cri .234 and < 81.234 are identical, the coefficient of 
multiple correlation, Ri. 3 u, may be computed from the 
equation 

P2 — 1 e-^ 1.23 . . . n 

a 1.23 ... n — 1 — -5-" 

or, using the formula for o-=,,234 ... from the equation 
1 - . . . „ = (1 - rh2)(l - r^.,.2)(l - r“.4.«) ... 

(1 — r\n.3i . . . (»~l). 

BETA COEFFICIENTS 

The several coefficients of regression in an equation of 
multiple regression are, in effect, weights applied to the 
different independent variables in estimating the successive 
values of the dependent variable. Usually these coefficients 
of regression are not comparable, because the independent 
factors are expressed in different units, or because they 
differ in variability. It is ofteil desirable to reduce the 
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coefBcients of regression to comparable terms. This mav 
be done by expressing dependent and independent variables 
alike m units of their respective standard deviations The 
coefficients of regression are then called coefficients 
and are represented by the symbols pi 2 .u, 1813 . 24 , etc. ’ 
In terms of a simple two-variable problem, we have 

Xi = 6130:3. 

If we change to standard deviation units we must divide 
both sides of the equation by <ri and by 0 - 3 . This gives 

<^ 10*3 (Ti 

or 

The desired Beta coefficient is, then, 

ft 3 = 613(g)- 

For the corn yield example. We have 


i\<rs/ 


.696. 


This may be taken to mean that with an increase of one 
standard deviation in As (July temperature), the yield of 
corn decreased . 696 of one standard deviation. 

_ These measurements are particularly useful in analyses 
involving more than two variables. Here the relationships 
between the beta coefficients and the coefficients of net 
regression are similar to those indicated for the two- 
variable problem. Thus 


0U:. 


0li 


!.84 612. 34^—^ 

6i3^4(g) 
6u.23(g). 
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Substituting the required values in these equations, we have 
fe .34 = ~ .209 

dl3.24 == ~ .529 
dl4.23 = — . 292 . 

The second of these coefficients may be taken to mean that 
with an increase of one standard deviation in July tempera- 
ture, wffien June and August temperatures are held constant, 
corn yield decreases .529 of one standard deviation. The 
other coefficients have similar meanings. 

The beta coefficients relate to factors expressed in com- 
parable units and similar in respect of variability. A 
fluctuation of one standard deviation in Xa may be taken 
to be equal to a fluctuation of one standard deviation in X3. 
The coefficients defining the changes in Xi that are likely 
to accompany these equal movements in Xj and X3 have 
obvious significance. 

CERTAIN LIMITATIONS 

The measures we have described in dealing with problems 
of multiple and partial correlation are valid on the assump- 
tion that the relationships among the different variables 
are in all cases linear. Thus with four variables six different 
pairs may be obtained. The regression in each of these 
six cases should be linear if combined or net effects are to 
be studied by the methods outlined above. If the regression 
is non-linear when natural numbers are dealt with, it may 
be possible to secure linear relationships by correlating 
logarithms or reciprocals. Thus we might derive an esti- 
mating equation of the type 

Log Xi = o •+• 612.34X2 -f- 613.24X3 + 614.23X4 

if the relation between Xi in logarithmic form and each 
of the other variables in the original arithmetic form were 
linear. The corresponding measures, S and R, would then 


I 
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chapL^ examples given in the following 

_ One other important limitation should be noted. Coeffi- 
cients of multiple or of net correlation based upon a kr-o. 
number of variables have little significance unless the num! 
er of observations be large. Misleadingly high values will 
be secured when studies involving many variables are lid 
upon sma 1 samples. (Application of the corrections referred 
to m the text will prevent misinterpretation, in such cases ) 
Witten the limits set by these restrictions, the methods oC 
multiple and ^partial correlation constitute very powe rfte 
instruments of economic analysis. ^ Powerful 

references 

PaSufchap^e. “feMarJ Steifa, 
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"f a method of 


’ tory work in the study of relations m ^ adapted to explora- 

to which they may lejid are limited bv the fnirH f’*'® ®”"'*>us‘ons 

/ oda of defining regresdon o^r^ n»eth- 

analytical work. For a diaouas^on of the^meti 7^^®“ *7** ®*®“ents into the 
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CHAPTEE XVII 


THE MEASUREMENT OF RELATIONSHIP AND 
THE PROBLEM OF ESTIMATION 

It is no great exaggeration to say that quantitative 
method in economics and business centers about the prob- 
lem of estimation. Equations of regression, measures of 
standard error and coefficients of correlation are of interest 
largely because of their bearing upon the practical problems 
of determining probable production, probable price, probable 

th«TtT be understood from this 

at the problem of estimation relates only to attempts to 

ta future change. We make an esLate Sev r 
we seek to determine the most probable value from a 
number of different observations, or whenever we employ 
an equation which describes the relation between two or 
more variables. Ihe value of statistical technique rests 

estiiSes^''"* making' of 

This object has been definitely to the fore in the preceding 
chapters, which dealt with methods by which the valuf 
of one variable might be estimated from a given value 
of another. We may, at this point, brieflyT^mate 
ceHam assumptions upon which the validity of this method 

Some Assumptions Involved in the Making op 
Estimates 

prJba^'^u^rf ttat the most 

SZ observations is their arithmetic 

™ distribution about the mean the 

standard deviation affords an exact measure „f 
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bilities involved in basing estimates upon the mean. 
Similarly, the standard error of estimate affords an exact 
measure of the probabilities involved in basing estimates 
upon an equation of regression, again upon the assumption 
that the distribution about the line of regression is normal. 
The significance and usefulness of the equation of regression 
may be determined by comparing the standard error of 
estimate of a given variable with the standard deviation. 

From the relation between these two values, moreover, 
an abstract measure of relationship, the coefficient or index 
of correlation, may be computed. This coefficient, or index, 
is a thoroughly valid and accurate measure only if the 
distribution about the line of regression and the distribution 
about the mean are normal, or approximately so. Pro- 
nounced departures from the normal type lessen the signifi- 
cance of these measures. 

In the foregoing discussion we have been concerned with 
arithmetic values thi'oughout. In speaking of estimates 
based upon the mean we referred to the arithmetic mean. 
The distributions about the mean and about the line of 
regression are assumed to be normal when deviations are 
measured arithmetically. The standard deviation and the 
standard error of estimate are in arithmetic terms, referring 
to absolute values. But may we assume that all the 
distributions we deal with in economic analysis are of the 
arithmetic type? Should estimates be made and errors 
of estimate measured only in arithmetic terms? If they 
should not be so limited, are the methods developed above 
capable of adaptation to other distributions? These ques- 
tions may best be answered in terms of a specific problem. 

A Problem op Estimation: Logarithmic and Ratio 

Valles 

In Table 132 the production and price of oats in the 
United States from 1881 to 1913 are recorded. Appropriate 
lines of trend were fitted to these series and the ratios 
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Table 132 

Produdion and Price of Oats in the United States 

'oductiofi „ ■ , _ j . 


Year 


1881 

1882 

1883 

1884 

1885 

1886 
1887 . 
1888 

1889 

1890 

1891 

1892 

1893 

1894 

1895 

1896 

1897 

1898 

1899 

1900 

1901 

1902 

1903 

1904 

1905 

1906 

1907 

1908 

1909 

1910 

1911 

1912 

1913 


Prodmtmi 

of oats in 

■ 

{milUom 
of hu,) 

416 
488 
571 
583 
629 
624 
659 
701 
751 
523 
738 
661 
■'639 
662 
824 
780 
791 
843 
926 
914 
778 
1,053 
869 
1,009 
1,090 
1,036 
805 
851 
1,068 
1,186 
922 
1,418 
1,122 


Straight 
line trend 
of produc- 
tion ^ 

448 
471 
494 
517 
540 
563 
586 
609 
632 
655 
678 
701 
724 
747 
770 
793 
816 
839 
862 
885 
908 
931 
954 
977 
1,000 
1,023 
1,046 
1,069 
1,092 
1,115 
1,138 
1,161 
1,184 


Ratio of 
actual pro- 
duction to 
trend value 


Price of 
oats in 
Chicago 
{cents 
per hu,) 


.929 

47 

1.036 

. 37 ' 

1.156 

31 

1.128 

29 

1.165 

28 

1.108 

25 

1.124 

30 

1.151 

24 

1.188 

24 , 

.798 

43 

1.088 

31 

.943 

30 

.882 

31 

.886 

28 

1.070 

19 

.983 

18 

.969 

24 

1.005 

■25 ■ 

1.074 

23 

1.033 

25 ' 

.857 

42 

1.131 

33 

.911 

38 

1.033 

30 

1.090 

31 

1.013 

39 

.770 

51 

.796 

52 

.978 

43 

1064 

35 

.810 

51 

1.221 

37 

.948 

41 


^2 

L5 

.2 


Straight 
line trend 
of price 

■•^' 36.0 ■■ 

35.3 

34.6 

34.0 
33 . 

32 . 

31 . 

30.5 

29.8 

29.0 

28.3 
■ 27.5 

26.8 
26 . 

' 25 . 

, 23 . 

, 25.0 

26.4 
' 27.8 

29.2 

30.6 

32.0 

33.4 

34.8 

36.2 

37.6 

39.0 

40.4 

41.8 

43.2 

44.6 

46.0 

47.4 


.1 

.3 

.6 


Ratio of 
actual 
price to. 
I'^ond. value 

1.30 

1.05 

.90 

. 85 ' 

84 

77 : ■ 


■ '.79 
': . 81 ''' 

1.48 
1.10 
1.09 
1.16 
1.07 
.75 
.76 . 
.96 
.95 
. 83 ' 
.86 
1.37 

1.03 
1.14 

.86 

.86 

1.04 
1.31 
1.29 
1.03 

.81 
1.14 
80 
87 ^' 


^ 47.4 « 7 .' 

^ period than that i„- 
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of the actual values of the items in each series to the trend 
values determined. 

It is desired to measure the relation between these two 
variables. A hyperbolic curve of the general type Y = aX” 
appears to be an appropriate form to employ in describing 
such a relationship. To fit this curve by the method of 
least squares, the equation must be reduced to the loga- 
rithmic form 

log 7 = log a -1- 5 log X. 

The normal equations required in fitting a curve of this 
type, are 

I S(log F) = AT log a + h 2(log X) 

11 S(Iog Z • log F) = log aSClog X) -f fcS(log* X). 

The values necessary for the solution of these equations 
are determined from Table 133. ‘ 

From this table we have 


N 

S(log F) 
S(logX) 


.037535 


S(logX • log F) 
S(log^X) 


- .1143006 
.096423. 


Substituting in the normal equations, we secure 

- .32849 = 33 log a -f .0375356 
- .1143005 = .037535 log a + .0964236. 

Solving 

log a =- .00861 
6 = - 1.18206. 

The required equation is 

log F = (9.99139 - 10} - 1.18206 log X 


F = .9804X”'-"’“®. ’ ■ 

^ I am indebted to Prof. H, B. Kiiiough of Brown University for permission 
to use the data presented in Tables 132 and 133. The figures are taken from his 
comprehensive study of the factors afieoting oat prices. . 
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This is^ the equation which describes the average 
lonshzp^ between the production and the pJce 5 'r 
when the actnet flguree for eaoh are exprefeed ae raS 
to the respective lines of trend). The corresnonH^«!. 

IS plotted in Fig. 88 on page 592. ^ curve 

Table 133 

Computation of Values Required in Fitting a Curve to Data of Oat 
J^Toductwn (ZTid Pvices 
Example I 


(3) 

Batio 
of pro* 
Auction to 
trend 
X 



(4) 


logy 
.1139434 
.0211893 
.9642425 
.9294189 
.9242793 
.8864907 
.9822712 
.8976271 
.9084850 
. 1702617 
.0413927 
.0374285 
.0644580 
.0293838 
.8750613 - 
.8808136 - 
.9822712 - 
.9777236 - 
.0190781 - 
.934498.5 - 
. 1367206 
.0128372 
.0569049 
.9344985 - 
.9344985 - 
.0170333 
.1172713 
.1105897 
.0128372 
. 9084860 - 
.0,569049 
. 9030900 “ 
.9395193 


(5) 


% X 

.9680167 - 
.0163698 
.0629678 
.0523091 
.0663259 
.044.5398 
.0507663 
.0610753 
.0748164 
.9020029 
.0366289 
.9745117 
.9464686 
.9474337 
.0293838 
.9925636 
.9863238 
.0021661 
.0310043 
.0141003 
.9.329808 - 
.0534626 
.9695184 - 
.0141003 
.0374265 
.0056094 
.8864907 - 
.9009131 - 
.9903389 - 
.0269416 
. 9084850 - 
.0867157 
.9768083 


- 1 

- 1 
~ 1 
- 1 , 

- 1 
~ 1 


( 6 ) 


log^ Y 
.01298310 
.00044899 
.00209376 
.00498169 
.00573362 
.01288436 
.00031431 
.01048021 
.00837600 
.02898905 
.00171336 
. 00140074 
.00415483 
.00086341 
.01560968 
.01420540 
.00031431 
.00049624 
.00654836 
,00429045 
.01726316 
.00016479 
.00323817 
.00429046 
.00429046 
.00029013 
.01375256 
.01223008 
.00016479 
.00837500 
.00323817 
.00939165 
00366792 


(7) 


17.67l5068irig 14.0375350:^4 .Illlfil 


log^X 
.001022995 
.000235923 
•003963685 
.002736242 
.004399126 
.001983794 
.002,577217 
.003730192 
.005597494 
.009803432 
.001341676 
.000649653 
.002973674 
.002763216 
.000863408 
.000066450 
•000187038 
.000004692 . 
.000961267 ■ 
.000198818 • 
.004491573 - 
.002868250 
.001638760 - 
.000198818 - 
.001400743 - 
.000031465 
.012884361 - 
.009818214 - 
.000093337 - 
.000725850 - 
.008374812 - 
.007519613 - 
. 000537 855 
.096422642 


(8) 


^og F * 
.0036444 
.0003255 
- .0028808 


. 0060222 
0050557 
.0009000 
.0062624 
.0068488 
•0166852 
.0016162 
.0009639 
.00351,60 
.0015446 
•0036712 
.0008875 
0002425 
0000483 
0026089 
0009236 
0091629 
0006863 
0023036 
0009236 
.0024515 
.0000956 
■0133113 
.0109680 
0001240 
■ 0024656 
.0052070 
. 0084036 
0014027 
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■f .0051562 
■ ■ ~ .1143005 

the standard error op estimate Tivr too 
„ . estimate in logarithmic terms 

confidence Ly Iti^tes'^lTbZd ^“n 

— the “Xd ~ 

bmee the fittnrg precis wan earned through in C; f, 
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logarithms, the standard error may be computed in the 
same terms. Following the procedure explained in earlier 
sections with reference to the straight line and the potential 
series, we may derive the following equation relating to 
the logarithmic curve just fitted: 

_ S(log2 Y) - log aS(log Y) - 5S(log A • log F) 

O log y — ' ^ 

N 

Substituting the proper values, we have 

= •217 21807 - ( -.00861 X -.32849) ~ (- 1.1820 6 X -. 1143005) 

■ 33 

.07927928 

cs 

33 

/Shogy = .0024024 
-Slog y = .04901. 

The standard error of estimate, in the form of a loga- 
rithm, is .04901. As long as we deal with logarithms, 
this is to be interpreted precisely as is the standard error 
with respect to other curves. Assuming a normal distribution 
of logarithms about the curve which describes the average re- 
lationship, the chances are 68 out of 100 that the logarithm 
of a given estimate will not differ from the logarithm of the 
actual value by more than .04901, 95 out of 100 that the 
logarithm of the given estimate will not differ from the 
logarithm of the actual value by more than .09802, and 
99.7 out of 100 that the logarithm of the given estimate 
will not differ from the logarithm of the actual value by 
more than . 14703. 

INTERPRETATION OP THE STANDARD ERROR OP ESTIMATE; 
ZONES OF ESTIMATE 

What does this mean in terms of actual values? It means, 
simply, that we are dealing throughout in terms of ratios 
instead of absolute figures. The difference between the 
logarithms of two numbers is the logarithm of the ratio 
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of one of the original numbers to the other. Thus tl 
^solute value of 5 m a given case will depend upon th 
magnitude of the values with which we are deahW i 
the user desires to reduce ^ to absolute values, it mL h 
done alwaj^ with reference to a given estimate. tC ■ 
a given value of X is substituted in the eauation of of 
relaao^ip 3.d the coneeponding vIir^KeltS 

m the fonn of a logarithm. To the logarithm of the estim.t 
add the value of S,.,. The anti-loirithm of the nuX 
thus secured win give the upper limit of a cone exteiShi^ 
a distance equal to S above the Hne of regression Fr- ^ 
tte logmthm of the estimate subtract the value of 
me auti-loganthm of the number thus teeured iSl rit: 
the ower limit of a zone extending a distance ZJZ 

thaftee *“'* '‘■’t' “"I* "e 68 out of 'loo 

limit tu ™ the pven ease will fall within the 

limits thus marked out. The absolute liTni’ic ^ v ® 

to and 3d may be similarlytte“ ™P<>-d.ng 

cur^'wrdiSt'ZSMrf “ Itsatithmic 

inta»^X lteer:t= tt 

and IS centered always at the line ^1^^ Tb T ' 

rithmic zone, when measiireH in LV^T , 

varying width, and, moreover is not^'f 

on each side of the ^lottrf c u™ fit , t 

the ratios on the two sides of tb ‘hat 

That is, the ratio of a iSue “ iSsThaTth™'" 

“'dSrrif fn 

takes the symmetrical form found'in the mr *’*°**^‘’ . 

A peiaon accustomedrhhi^i^^t^forS^ 
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and to the use of logarithmic paper can readily interpret 
this measure. 

THE STANDARD ERROR OF ESTIMATE IN TERMS OF RATIOS 

Since the ratios are eciual throughout, the standard error 
of estimate may be expressed in ratio terms. In the present 
example we have 

Sr = anti-log /Slog „ = anti-log .04901 = 1.12 

where ,Sr is used to represent the standard error of estimate 
in terms of ratios. (S'iog,„ as derh^ed above, is positive, 
hence the ratio exceeds unity. It is the ratio of the larger 
number to the smaller. What does it mean? It means 
that in 68 cases out of 100 the actual value, if it exceed 
the estimate, will not exceed it l»y more than 12 per cent, 
and, if it fall below the estimate, will stay within a 
limit such that the estimate will not be more than 12 per 
amt greater than tlic actual valium This is not a coiu'cn- 
ieut form, since this ratio always (‘xpresses the largia- \’aluc‘ 
in terms of the smaller valu(\ It would be more conven- 
ient to have it always in terms of a percentage of the esti- 
mate. This may be done by putt ing /iSiog „ in negative 1 erms, 
and getting the corresponding natural value. The value' 
— .04901 = 9.95099 - 10, which is the logarithm of .8933. 
In this form the ratio is based upon the relation of the 
smaller to the larger number. To make Sr readily intelligible 
we may combine the two, writing 

Sr = .89 to 1.12. 

Interpreting this, it means that, given a normal distribution, 
in 68 cases out of 100 the actual value will not be less than 
89 per cent of the estimate, or more than 112 per cent of 
the estimate. This has a simple, definite meaning more 
significant for most practical purposes than a similar 
measure in terms of absolute values. 

’ The significance of a measure of reliability in percentage form was pointed 
out by D. H. Davenport in 1922, in an unpublislied article, and such a measure 
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To find the values of 2^- or 3S these percentaffe 
may not be simply multiplied by 2 or 3. tL vafc 
Si,,, must be so multiplied, and the resulting values reduJ] 

to natural numbers. For convenience in usT^e tn? 

logarithms of both the positive and negative values shnni l 

be secured, as in the preceding case. The comnntnf 
are simple. computations 

2<S'ioe„= .09802. 

3-Sio^y = .14703. 

The corresponding anti-logarithms are I 40 and 7T «„ 
marramg for the standard error, we have 

Sr = .89tol. 12 
2i8,. = .80to 1.25 
^Sr = .71 to 1 .40. 

aPPnrcaT,o» op the btanbahb ehsoe op estmate 

We may illustrate the use of .<? n- 

of oats 50 per cent above the trend vate fie \Z f'T 
trend is l.fiOl what I'c +1, a ® the ratio to 

price ratio and what is tL^^de Pmbable accompanying 
estimate? ^^^^^acy of this 

available, however 

fore, been fully realized. ' ^ “ “ea&ure, and its possibilities have not, there-’ 
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The estimating equation is 

log F = (9.99139 - 10) - 1.18206 log X. 

Substituting in this equation the value .176091 (the loga- 
rithm of 1 . 50) we secure for log F the value 9 . 78324 — 10. 
The corresponding natural number is .607. This means 
that if production is 150 per cent of normal (as measured 
by the given line of trend) price will probably be 60.7 
per cent of normal (as measured by the line of trend). 

To determine the reliability of this estimate, the standard 
error must be secured. Employing the values of Sr already 
computed we find that 54 is 89 per cent of 60.7, while 
68 is 112 per cent of 60.7. We interpret these figures to 
mean that in 68 cases out of 100 the actual price prevailing 
under the given production conditions will not be less than 
54 per cent of the normal or trend value nor more than 
68 per cent of normal.* Corresponding values for 28, 
and dSr may be determined in the manner outlined above. 

THE INDEX OF CORRELATION BASED ON LOGARITHMIC VALUES 

We have still to compute the third measure, the abstract 
index of correlation.** For an equation of the type 

log Y = log a + h log X 

the formula for p reduces to 

log aS(Iog F) + 5S(lQg X • lo g Y) - Nc ho,, 

P log « log X - - -2 (Ihg® F) - Nc\o^ , 

where ciog „ represents the difference between the arithmetic 
mean of the logarithms of the F-values and the origin 
(in this case, zero on the logarithmic scale). Substituting 

1 A question arises at once as to the adequacy of the given lines of trend, 
in tiie present problem. This question is discuss^ in greater detail in another 
section. 

2 The symbol P is used for this measure of correlation, instead of r, even 
though the relationship in logarithmic form is linear. This is done because such 
a measure, in terms of logarithms, cannot be interpreted in precisely the same 
way as the ordinary coefficient of correlation. 




576 THE PROBLEM OF ESTIMATION 
the proper values, we have 

P^iog log aJ 

=; I r -00861 X - .3 2 849) + ( - 1.1S206 X - .11430051 - v p 

.2mll07^r(33-x:066099^^ 

_ .13466882 
.2139481 

= .629445 
Plog 2 / log a: = .793. 

“dex of correlation has a value of .793. How is 

this to be interpreted when we are dealing with logaritL! 
as m the present case? fe vviwi io^aiitUms 

relil2up“““ of 'ko 

« log ^ = 1 - 

0 

In the present case these values are 

''?iogi, = .04901 
viogp = .08052. 

f^daf^havr ““ “ “"o above 

.002402 


f^^Iog )/ 


P log 1/ log I — 1 


and 


^006483 


Piog v log X — . 793. 

What does this value measure? We have seen +ha+ 
bonahip ^ described by given functions. The v^Te of 

“tted1S™in,arS r ‘'?f abo";^ 

of the F’s. If fliA • u-i-A ^^^^o-bihty about the mean 
reduced when the eauation^^^ ostnnates is materially 
for estimates, instead of thele^T'JT,'! “‘’“o “ “ 
»u»ed to describe a signiheant 
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of p depends thus upon the relation between the two 
quantities, Sy and a-y. 

In the cases dealt with in the preceding chapter the 
variability in each case was measured in terms of absolute 
deviations, and the value of p depended upon the relation 
between the two given measures of absolute variability; 
The sole difference in the present case is that we are working 
in terms of logarithmic or ratio variability, deviations being 
measured in terms of logarithms instead of natural numbers. 

The index p must be interpreted in the light of this fact. 
Its value, as always, depends upon the relation between 
two measures of variability, and but in tlie present 
instance these are expressed in terms of logarithms. In 
brief, the value of p depends upon the relation between 
the ratio variability about the fitted curve and the ratio 
variability about the geometric mean of the F’s. (It is 
the geometric mean of the P’s, because that is the value 
corresponding to the arithmetic mean of the Y logarithms.) 

We have here a set of measures, therefore, which i)erform 
in the field of ratios precisely the same service as is per- 
formed in the field of natural numbers by S and p (in the 
linear case, r). These measures are secured in the same 
way as are 8 and p, except that the equation of relationship 
from which they are derived is one in which the dependent 
variable is log F (or, in the reverse case, log X). The general 
formulas for computing these values are the same as in 
dealing with natural numbers, except that log F replaces 
F throughout. The operation is analogous to that of using 
logarithmic paper instead of natural scale paper. 

It should be noted that the values are in logarithmic or 
ratio form if F is expressed logarithmically, whether X 
be so expressed or not. Thus we have fitted a curve of 
the type 

log F =s log a -1- 6 log X 

the logarithmic form of the ordinary parabola or hyperbola. 
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The values S and p would also be in logarithmic form if 
the curve were of the type 

log y = log a + Z log 6 
the logarithmic form of the exponential 

F = a(b^). 

In each of these cases the logarithmic equation is linear, 
but this is not essential to the use of these measures. S and 
p are generally applicable measures, whether ratios or nat- 
ural numbers be dealt with, and whether the functions be 
linear or otherwise. 

It may be well at this point to summarize the symbols 
that have been used and to distinguish the different meas- 
ures. We may employ the symbols Sy, <Ty, and p when 
arithmetic relations are in question, the two former being 
measures of variation in absolute terms, and the index p 
referring to degree of relationship when natural numbers 
are employed. If the logarithms of the F’s are used it is 
advisable to distinguish the symbols by subscripts, using 
iSiogj and cTiogj, as measures of the logarithmic variation 
about the fitted curve and about the arithmetic mean 
of the logarithms of the F’s, respectively. If ^log y is reduced 
to ratio form, it may be written S>r. Since the index p 
must be interpreted somewhat differently in this case, it 
may be written piogviogx, or pi„g„x. 

The Use of RecipbocaIjS in the Measurement op 
Relationship 

Another type of curve may be used to describe the 
relationship between the production and price of oats, and 
its use introduces us to a third field of correlation, a field 
in which somewhat new concepts enter, and in which the 
various measures must be interpreted in still another way. 
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This is a curve of the type 

Y — — — ' 
a + bX 

which may be expanded by adding additional terms to 
the denominator, as 

Y = ^ ' 

a + bX + cX^ 

This hyperbolic form has been used in several studies as 
an approximation to a “demand” curve for various com- 
modities. 

The equation to a curve of this type may be written 

i-=a + 6A 

which is the equation to a straight line describing the rela- 
tionship between the reciprocals of the F’s and the original 
X values. The normal equations required in fitting a 
curve of this type are 

I = Na + bSiX) 

II s(^y) = aS(A) + 6S(XF. 

The method of computing the necessary values is illustrated 
in Table 134. 

Substituting the proper values in the normal equations, 
we have 

34.3360320 = 33a 4- 33.3386 
35.2571485 = 33.3380 + 34.1685545. 

Solving, 

a = - .13 57 
5 = 1.1643. 
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Computation of Values Required in Fitting a Curve to Data of Oat 
Production and Prices 


Example II 


Year 


1881 

1882 

1883 

1884 
18a5 
1886 

1887 

1888 

1889 

1890 

1891 

1892 

1893 

1894 

1895 

1896 

1897 

1898 

1899 


Price 

Ratio 

Y 

1.30 
1.05 

.90 
.85 
.84 
.77 
.96 
.79 
.81 
1.48 
1. 10 
1.09 
1.16 
1.07 
.75 
.76 
.96 
.95 
.83 
.86 
1.37 

1.03 
1.14 

.86 

.86 

1.04 

1.31 
1.29 
1.03 

.81 

1.14 


(3) (4) 

Prodiic- 
lion 

Ratio 1 

X Y 

.929 .7692308 

1.036 .9523810 

1.156 1.1111111 

1.128 1.1764706 

1.165 1.1904762 

1.108 1.2987013 

1.124 1.0416667 

1.151 1.2658228 

1.188 1.2345679 

.798 .6756757 

1.088 .9090909 

.943 .9174312 

.882 .8620690 

.886 .9345794 

1.070 1.3333333 

.983 1,3157895 

.969 1.0416667 

1.005 1.0526316 

1.074 1.2048193 

1.033 1 . 1627907 

.857 .7299270 

1.131 .9708738 

.911 .8771930 

1.033 1.1627^7 

1.090 1.1627907 

1.013 .9615385 

.770 .7633588 

.796 .7751938 

.978 .9708738 

1.064 1.2345679 

.810 .8771930 

1.221 1.2500000 

■ 948 1.1494253 

33.a38 34.3360320 


(5) 

(6) 

(7) 

X 

/1\2 


Y 

\y) 


.7146154 

.59171602 

.863041 

.9866667 

.90702967 

1.073296 

1.2844444 

1.23456788 

1.336336 

1.3270588 

1.38408307 

1.272384 

1.3869048 

1.41723358 

1.357225 

1.4389610 

1.68662507 

1.227664 

1.1708334 

1.08.506951 

1.263376 

1.4569620 

1.60230736 

1.324801 

1.4666667 

1.. 52415790 

1.411344 

.5391892 

.45653765 

.636804 

.9890909 

.82644626 

1.183744 

.8651376 

.84168001 

.889249 

.7603449 

.74316296 

.777924 

.8280373 

.87343865 

.784996 

1.4266666 

1.77777769 

1 . 144900 

1.2934211 

1.73130201 

.966289 

1.0093750 

1.0&506951 

.938961 

1.0578948 

1.10803329 

1.010025 

1.2939759 

1.45158955 

1.153476 

1.2011628 

1.35208221 

1.067089 

.6255480 

.53279343 

.734449 

1.0980583 

.94259594 

1.279161 

.7991228 

.76946756 

.829921 

1.2011628 

1.352CS221 

1.067089 

1.2674419 

1.35208221 

1.188100 

.9740385 

.924.55629 

1.026169 

.5877863 

.58271666 

.592900 

.6170543 

.60092543 

.633616 

.9495146 

.942.59594 

.956484 

1.3135802 

1.52415790 

1.132096 

.7105263 

.76946756 

.656100 

1.5262500 

1.56250000 

1.490841 

1 ,0896652 

1.32117852 

.898704 

35.2671485 

36.85702M0 

34.168554 


equation is, therefore, 


TabIjE 134 
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the standard error and the index op correlation in 

TEEMS OP RECIPROCALS 

To determine the utility of this equation we must have 
the standard error and the index of correlation. The two 
necessary formulas may be derived as in the preceding 

cases. Representing by y the reciprocal of an actual value 

we have, for each residual, 

d = a + bX-Y ( 1 ) 

Multiplying by d and summing 

2(d2) = aE(d) + 52(&) - S (ly 


m = 0 and 'S{dX) = 0, 


we have 


Multiplying the residual equation (1) now by and sum- 
ming, we have 


Substituting the equivalent of S in the preceding equa- 


tion (2), we secure 


and for we have 
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Inserting this value of Si^ in the general formula for the 

y 

index of correlation 

cri‘ 

y 

and simplifying, we have 

Inserting the proper values in these two equations, we find 
that 


pV 


.1191 


Si 

Pi, == -766. 


For the standard deviation of the original F-values, in 
terms of reciprocals, we secure 


cTi = .1851. 

y 


1 


(The subscript - is used in connection with each of these 

y 

measures, as they should be distinguished from measures 
based upon natural numbers or logarithms.) 

INTERPRETATION OP THE STANDARD ERROR OP ESTIMATE 

How may we interpret these results? As in all former 
problems of this type the equation gives us a means of 
estimating Y from a known value of X. The standard 
error Si serves as a measure of the reliability of such 

estimates, and pi^ is an abstract measure of the degree of 

relationship between the two variables. But in the present 
case all these measures are in terms of reciprocals. The 
equation enables us to estimate the reciprocal of Y, the 


f 
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standard error has significance only in the form of a recipro- 
cal, and the value of p depends upon the relation between 

two measures ()Si^ and both of which are in terms of 

y ■ . ^ y 

reciprocals. 

An illustration may make these meanings clear. If, in 
a given year, the production of oats is 150 per cent of 
trend, what is the most probable price? Substituting in 
the equation 


]_ 

F 


.1357 + 1.1643A 


a value of 1.50 for X, we have 

1 


= 1.6108 


and 


F = . 621 . 


We may expect a price approximately 62 per cent of trend. 
As a measure of the reliability of this estimate, we have 

Si = .1191. 

V 

This must be applied to the estimate in terms of reciprocals. 
Thus we have 

1.6108 -b .1191 = 1.7299 
1.6108 - .1191 = 1.4917. 


Reducing these reciprocals to natural numbers we secure 
.578 and .670 as the desired values. The most probable 
price, then, is 62.1 per cent of trend, and, on the assump- 
tion of an approximately normal distribution of reciprocals 
about the curve, the odds are 68 out of 100 that the price 
will fall between 57.8 per cent of trend and 67.0 per cent 
of trend. The limits of 2S and SS may be similarly deter- 
mined by adding to and subtracting from the estimate, 
as a reciprocal, amounts equal to twice .1191 and three 
times .1191. The results secured may then be converted 
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to natural numbers. Just as with logarithms, the value 
in absolute terms of a given difference between reciprocals 
varies at different points within the range of F- values. 
Accordingly, the limits of reliability determined from S, 

should be expressed in natvaal numbers only after a particu- 
lar estimate has been made. 

A Comparison op Measures op Relationship 

In interpreting p similar considerations enter. The value 
of the index of correlation, as we have seen, depends upon 
the degree of variation about the curve, as compared with 
the variation about the average of the original dependent 
series. In handling natural numbers, variability about the 
fitted line is compared with the variability about the 
arithmetic mean of the dependent valuable, both measured 
in absolute terms (i.e., is compared with o-J. In handling 
logarithms, variability about the fitted line is compared 
with variability about the arithmetic mean of the loga- 
rithms of the dependent series, variability being measured 
in each case in terms of logarithms. But logarithmic 
deviations, as we have seen, may be interpreted in terms 
of ratios. The logarithmic deviations from the line represent 
the ratios of actual values to computed, while logarithmic 
deviations about the arithmetic mean of the logarithms of 
the original series represent the ratios of the actual values 
of the dependent series to their geometric mean,. The value 
of piogv depends upon the relation between these respective 
deviations (i.e., (Siog, is compared with triogi,). 

In fitting a curve in which the reciprocals of the dependent 
variable are employed, variability about the fitted line is 
measured in terms of reciprocal, and the variability of 
ihe priginal series is measured in the same terms. That 
is, (Ti is computed from the differences between the recipro- 

t, ^ 

cals of the actual values and the arithmetic mean of all 
these reciprocals. But Hie arithmetic mean of these recipro- 
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cals is the reciprocal of the harmonic mean. Thus, in 
short, the value of the index of correlation, pi, depends upon 

the relation between variability about the fitted line and 
variability about the harmonic mean of the dependent series, 
variation in both cases being measured in terms of reciprocals 
(i. e., *Si is compared with ffi). 

y V 

We have, therefore, three broad families of curves for 
describing the relationship between variable quantities. 
These are: 

1. Curves in the fitting of which natural values of the dependent 

variable are employed. Equations to all curves of this family 
will be of the type 

T=/(A). 

2. Curves in the fitting of which logarithms of the dependent 

variable are employed. In all such cases the equations will be 
of the typQ 

logy=/(A). 

3. Curves in the fitting of which reciprocals of the deixmdent 

variable are employed. For these curves the equations will 
be of the type 

y=/(A). 

In any one of these three cases the equations may be 
linear or non-linear. In so far as this problem of interpreta- 
tion is concerned, there is no limitation as to the function 
of X which may be employed. (The computation of S 
and p by the methods suggested above involves certain 
limitations, which are outlined elsewhere.) 

The standard error of estimate for the first family of 
curves is derived in terms of the original units of measure- 
ment (for the dependent variable) and has a direct and 
simple meaning in these terms. The index of correlation, 
for curves of this type, is a measure of the degree to which 
the absolvte variability of the dependent variable may be 
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lessened by measuring deviations from the fitted curve 
instead of from the arithmetic mean. 

The standard error of estimate for the second family of 
curves is derived, by the method outlined, in terms of 
logarithms. It is more convenient in general to give it mean- 
ing in terms of ratios. The index of correlation, 
is a measure of the degree to which the logarithmic” or 
ratio variability of the dependent variable may be lessened 
by computing deviations (or ratios) with the fitted curve 
instead of the wean as base. 

The standard error of estimate for the third family of 
curves is derived by the same process as in the other cases, 
but emerges as a reciprocal. The index of correlation, p, ’ 

is a measure of the degree to which the variability of the 
dependent variable, in terms of reciprocals, may be lessened 
by computing reciprocal deviations from the fitted curve 
instead of from the harmonic mean. 

FACTORS GOVERNING THE CHOICE OF MEASURES OF 
RELATIONSHIP 

It is clear, therefore, that the choice of a type of curve 
to describe a given relationship must be governed by basic 
considerations as to the type of average which is most 
appropriate as a measure of the central tendency of the 
given series. And this brings in a related question as to 
whether the dispersion about this average more nearly 
approximates the normal type when measured in absolute 
terms, in logarithms, or in reciprocals. In selecting a 
curve and in using the measures S and p there is always 
present an implicit assumption with respect to these points. 

When absolute values are important, and the dispersion 
of the dependent variable approaches the normal type when 
plotted on an arithmetic scale, measures of relationship of 
the arithmetic type would appear to be appropriate. But, 
as we have seen, in handfing series in which rates of change 
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rather than absolute amounts of change are of primary 
importance and the dispersion appears to follow a geometric 
law, the arithmetic mean and other arithmetic measures 
are notoriously inadequate. In such cases logarithmic curves 
seem preferable to arithmetic, and measures of the reliability 
of estimates and of degree of relationship which are based 
upon ratios seem to be more suitable than those based upon 
absolute values. 

The harmonic mean has not been so widely employed as 
either of the above averages, and some attention may be 
given to principles governing its use in problems of the 
type here considered. In general, such harmonic measures 
are marked by the same weaknesses as the arithmetic, 
except that they err in the opposite direction. Geometric 
measures are perhaps better adapted to all-around employ- 
ment than either. Yet in one particular field of interest 
to the economist the harmonic mean is particularly appro- 
priate, and the utilization of reciprocals, as in the preceding 
example, seems to be justified. 

The use of the harmonic mean assumes a normal distribu- 
tion of reciprocals which, in natural numbers, means a 
much wider scatter above the average than below. The 
use of a curve of the type 

^ = a + bX 

involves a similar assumption as to the relation between 
Y and X. A given absolute increase in X will be accom- 
panied by a certain decrease in the value of Y. The same 
absolute decrease in X will be accompanied by an increase 
in the value of Y which is larger than the decrease registered 
in the preceding case. But this is the relation which prevails, 
for many commodities, between the amounts produced and 
the price, the latter considered dependent. A given increase 
in production will cause some lowering of price. An equal 
decrease will cause a much greater increase in price. 
Moreover, when averaging the prices of such commodities 
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over a period, the harmonic mean may give a more typical 
value than any other average.* In such cases there is a 
strong a priori justification for using a curve of the reciprocal 
type and measuring the accuracy of all estimates in terms 
of harmonic relations. 

AEITHMETIC, GEOMETRIC, AND HARMONIC MEASURES 

The contrast between these different methods may be 
brought home most effectively by comparing the results 
obtained when curves of these three types are fitted to the 
same data. The computations involved in fitting curves 
of the second and third types (logarithmic and reciprocal) 
have been illustrated with reference to the data of oat 
production and prices (Table 132). A straight line (arith- 

^ “Buyers and sellers of potatoes are frequently mistaken as to the price 
justified by fundamental economic conditions. If such an error is general in the 
fall, it may happen, for example, that the price which results is too high. If the 
price is too high in the early part of the season, potatoes will not be consumed 
fast enough to dispose of the supply available. Farmers and dealers will then 
find that not all of tlie stocks on hand can Ix^ sold at existing prices. Since 
potatoes can not be carried over from one year to the next, the price, under 
such conditions as have been mentioned, must he lowered enough to permit 
the supply t/O be disposed of before the end of the season. A properly adjusted 
price would remain the same throughout the season, except for a gradual ad- 
vance to cover cost of storage, and would maintain a fairly uniform consump- 
tion throughout the season. But since an abnormally high price early in the 
season causes small consumption, it must be compensated by an abnormally 
low price during the remainder of the season, or not ail the crop can be sold. 

“Similarly, if the price is abnormally low early in the season, the supply will 
be exhausted too rapidly and tljose who still have potatoes will find that they 
can get abnormally high prices for them during the remainder of the season.” 

But how, given the abnormally high or abnormally low prices during part of 
a season, may we compute the average price which would be jastified by the 
true conditions of demand and supply, if these had been correctly estimated? 
Since “a low price during part of a season will foe compensjited only by a dis- 
proportionately high price during the remainder of the season” the arithmetic 
average for an entire season “will be somewhat higher than the average which 
would have resulted had a proper price been established at the beginning of the 
season. Thu difficulty u eliminated by taking the harmonic mean of the monthly 

Holbrook Working, Factors Determining the Price of Potatoes in SL Paul and 
Minneapolis, Technical Bulletin 10, University of Minnesota Agricultural 
Experiment Btation, 8-10. 
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inetic) is fitted to the same data, and the necessary .aceoiii- 
panying measures computed. .The .three sets of results are 
brought together in. .Table 135. ' 

Table 136. 

Relation between the Production mid Price of Oats, 1881-1913 
(Jotnpanmn of Results of Cuwe Fitt ing 
(Prices are the dependent variable in each case) 


Equation Stnndwrd error Index of 

of estimoRr correlation 

:E""' Y = 2.24 — 1.236 A, , 8^.= .12 . r^f^ — .783 

B i = - .1357 + 1.1643X Sr = .1191 pi = .“till 

■i y . 


C UgY = - .00801 - 1.18200 log Z .5,„„ = ,04901 

It is impossible to compare the thi'ee standard errors as 
they stand, since only the first one is in the original nnit.s 
of measurement (ratio of actual price to trend). In the 
following table are given estimates, based on each of these 
equations, as to the most probable price (in terms of ratio 
to trend) which would accompany each of five different 
conditions of production.^ Each estimate is accompanied 
by a series of values which indicate the limits set by the 
standard error. Throughout, the values of the estimates 
plus and minus S, 2S, and 3S are given, in order to indicate 
the probable scatter of actual values about the estimates. 
The different amounts of variation which may be expected 
about each of the three lines of relationship are measured 
by the actual differences between the estimates and the 
limiting cases. These differences are given in the columns 
headed A. All values in this table are comparable, being 
reduced to the original units (ratio of actual price frO trend). 

^ For the purpose of this illustration the limits of aetuiil observatifui iuive 
been exceeded in setting up Table 136. Such extrapolation involves the pos- 
sibility of errors of another sort. With these we are not here conceriuMi. 
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Table 136 

Comparison of Price Estimcdes and of Standard Errors of Estimate 
Based on Three Equations Relating to the Production and 
Price of Oats 


Cl) 

Value 
OfX 
(ratio 
■ d/' 
pro- 
duty 
tim to 
nor- 
mal) 

(2) 

Estimated 
mine of Y 
of 

price to 
trend) 
from 
arUhmetic 
equation 
(A) 

(3) 

Limits of 
arithmetic 
estimate 

(4) 

A 

<5) 

Estimated 
mine of Y 
from 
reciprocal 
equation 
(B) 

(6) 

Limits of 
estimate, 
reciprocal 

<7) 

A , 

(8) 

Estimated 
value of Y 
from 
logarith- 
mic 

equation 

iC) 

(9) 

Limits of 
logarithmic 
estimate 

(10) 

A 



+38*1.882 

+ .38 


+38*11.223: 

+8.983 


+38*3.114 

+ .880 



+23*1.882 

+ .24 


+28* 4.803 

+2,663 


+28=2.780 

:.+ .558 

,5 

1.622 

+8*1.742 

+ .12 

2.240 

+8* 3.055 

+ .815 

2.224 

+8 =2., 481 

+ .267 



-8*1.502 

-.12 


-8*. 1.768 

-.472 


-8 = 1.879 

-'.245 



-28*1.382 

-.24 


-28* 1.461 

- ,778 


-28 = 1.779 

-.446 



-38*1.2621 

-.36 


-38* 1.244 

- .996 


-38 = 1,679; 

- .846 



+38*1.611 

+ .36 


+38*2.281 

+1.024 


+38 = 1.786 

+ .610 



+28*1.481 

+ .24 


+28*1.794 

+ .637 


+28 = 1.685 

+ .319 

.8 

1.251 

+8*1.371 

+ .12 

1.257 

+8*1.478 

+ .221 

1.276 

+8 = 1.429 

+ .163 



-8*1.131 

- .12 


-8*1.083 

- . 164 


-8 = 1.136 

- .140 



-28*1.011 

- . 24 


-28* .867 

- .280 


“28 = 1.021 

- .265 



-3S« .881 

- .36 


-38* .887 

-.380 


-38* .906 

-.370 



+38 « 1.364 

+ .36 


+38*1.480 

+ .618 


.+38 = 1.372 

+ .302' 



+28*1.244 

+ .24 


+28*1,265; 

+ .283 


+28 = 1.226 

+ .2,45 

S.O 

1 004 

+8 » 1 . 124 

+ .12 

.872 

+8* 1.100; 

+ .128 

.980 

! +8*1.098 

+ .118 



-8* .884 

- .12 


-S« .871: 

-.101 


-8= .872 

-.108 



-28* .764 

-.24 


-28 «■ .789: 

- .183 


-28=' ,784 

-.196 



-38* .644 

-.36 


-38* .722, 

- .260 


-38= .696 

- .284 



+38 * 1 . 117 

+ , 36 


+38*1.106 

+ .313 


+38 = 1.106 

+ .316 



+28 .887 

+ .24 


+28* .877 

+ .184 


+28= .987 

+ .197 

1.2 

.757 

+8* .877 

+ .12 

.783 

+8* .876 

+ .082 

.790 

+8= .885 

+ .095 



-8* .637 

-.12 


-8* .724 

- .069 


-8= .703 

-.087 



-28* .517 

-.24 


-28* .667 

- .126 


-28= ■ .632 

- .168 



-38* .387 

-.36 


-38« .618 

- . 175 


-3B* '.561 

- .229 



„+3S».746 

+ , 36 


+38 *.798 

+ .177 


+38 = .852 

+ .245 



+28 *.626 

+ ,24 


+28 *.728 

+ .107 


+28 = .761 

+ .164 

I.S 

.386 

+8 * . 506 

+ .12 

.621 

+8 *.670 

+ .049 

.607 

+8 =.680 

+ .073 



-8* .266 

- .12 


-8 *.578 

- .043 


-8=.642 

-.066 



-28* .146 

- . 24 


-28 *.541 

- .080 


-28 =.484 

- . 123 




-38* .026 

. -.36 


-38* .508 

-.113 


-as = .433 

-.174 


Zones of Estimate and Their Significance 

A careful study of this table should make clear the nature 
of estimates based on the three types of equations here 
presented. The fundamental differences lie not so much 
in the actual values of the estimates, as in the standard 
errors which measure the reliability of these estimates and 
indicate the limits within which the actual values are likely 
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Pig. 87. — The Relation between the Production and Price of Oats: 
Illustrating the Use of an Arithmetic Equation of Regression and Arith- 
metic Zones of Estimate 

to fall. In other words, the differences lie in the assumptions 
made as to the character of the scatter about the curves. 

The measure Sy, which relates to the arithmetic curve, 
gives the same absolute range to errors of estimate whether 
the estimated value be high or low. An arithmetic dispersion 
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The original points are plotted, the straight line of relation- 
ship (arithmetic) is shown, and zones of estimate having 
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Fig. 89. — The Relation between the Production and Price of Oats : 
Illustrating the Use of a Logarithmic Equation of Regression and Geo- 
metric Zones of Estimate (Plotted on Double Logarithmic Paper) 

widths, respectively, of 2/S, 45, and 65, centering at the 
fitted line, are marked out. 

The measure 5iog y gives the same relative or percentage 
range to errors of estimate, whether the estimate be high 
or low. This means that the absolute range within which 
the actual values should fall is much less when the estimates 
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axe low than when they are high. It assumes a geometric 
dispersion about the curve which describes the relationship. 
The estimate is, in this case, the geometric mean of the 
value which exceeds it by an amount equal to ^iog„ (or 
any multiple of Si^g y) and the value which falls below it 
by an equal amount. Fig. 88 presents these relationships 
graphically. The original data are here plotted, together 
with the graph of the equation 

Y = .9804Z-1-18206. 

There are shown, also, the limits of zones of estimate having 
widths equal, respectively, to 2Sr, 4£fr, and QSr, centering 
(geometrically) at the line of relationship. A comparison 
of Fig. 87 and Fig. 88 will reveal the differences between 
estimates based on the assumption of an arithmetic distribu- 
tion and those based on the assumption of a geometric 
distribution. 

The points and lines shown in Fig. 88 are plotted on a 
logarithmic scale in Fig. 89. On this scale the curve of 
relationship becomes straight, and the zones of estimate 
appear as symmetrical and of equal width throughout the 
range. This transformation when the data are plotted on 
logarithmic paper makes clear the fundamental simplicity 
of the assumptions involved in making estimates from 
logarithmic values. 

In using the measure <Si we carry still further the assump- 

■ P ■ . 

tion that the variability about the curve is greater with 
high prices than with low. It shows a very limited range 
to errors of estimate when the estimate is low and a very 
wide range when the estimated price is high. A harmonic 
dispersion about the curve is assumed. The computed 
value, or estimate, is always the harmonic mean of the 
value which exceeds it by an amount equal to Si (or any 

multiple of Si) and the value which falls below it by an 

P ■ ■ ' ■ 

equal amount. 
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Fis. 90 . — The Relation between the Production and Price of Gats: 
Illustrating the Use of an Equation of Regression Based upon Reciprocals, 
and of Harmonic Zones of Estimate 


In Fig. 90 the curve ^ . 1357 + 1 . 1643X is plotted, 

together with the original observations. Zones of estimate 
with widths of 2Si, 4/Si, and 6iSi, centering (harmonically) 

5 } * ■ . 

at the fitted line, are shown. The differences between this 
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figure and each of the two preceding are quite marked 
particularly with respect to the zones of estimate. On the 
assumption of a normal harmonic distribution about the 
curve describing the relationship, the outer zone (with width 
equal to dS) marks the limits within which 99.7 per cent 
of all the points should fall, and the inner zone (with width 
equal to 2S) marks the linaits within which 68 per cent 
of all the points should fall. By plotting reciprocals through- 
out, instead of natural numbers, this apparently abnormal 
distribution could be reduced to the symmetrical form 
secured in plotting the geometric values on the logarithmic 
chart. 

For both high and low estimates the geometric measure, 
Slog,, stands between the arithmetic measure, S„ and the 
harmonic measure, Sj. While the two latter have their 

y 

particular functions, and are appropriate in certain cases, 
it is probably true that in using such methods as these in 
economic analysis, measures of the geometric family are 
more generally useful than those of the other types. This 
means, merely, that ratios are usually more important 
than absolute differences. It seems reasonable therefore 
to base estimates upon an equation of the type 

Log Y = /(Z) 

and to measure the reliability of these estimates in terms 
of logarithms or ratios, using „ or S,r. In such cases, as 
we have seen, correlation is measured by piogyiogx or piog,^. 
The value of this index depends upon the ratio variability 
about the curve, as compared with the ratiq variability 
about the geometric mean.^ 

^ 1 The reasoning in C. M. Walsh’s book, The Problem of Eslirmlion (London, 
King, 1921, p. 12.) is peculiarly applicable to the present problem. Citing Gali- 
■ leo, in defence of the use of the geometric mean in averaging estimates, Walsh 
writes; “And so errors must be measured by an error which is a ratio between 
the estimate and the true quaiUity, and not a concrete quantity itself. We cannot 
measure errors by so many pounds, feet or crowns; we must measure them 
w the proportions of the pounds, feet or crowns in the erroneous estimates to 
the pounds, feet or crowns in the thing estimated.” (Italics mine.) Tliis ar- 
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STATISTICAL INDUCTION AND THE PROBLEM 
OF SAMPLING, CONCLUDED 

The methods of induction discussed in an earlier section 
(Chapter XIV) dealt with the more familiar procedures 
employed in generalizing results secured from the study 
of samples. Certain research problems call for modifications 
of the methods there described, while for some purposes 
quite different instruments are needed. In the present 
chapter, therefore, we carry forward the discussion of 
statistical inference, considering methods appropriate to 
certain special conditions and special problems. 

Generalizing from Small Samples 

The standard error of an arithmetic mean, we have seen, 
is given by 


a 



where N is the number of observations in the sample and 
(T is the standard deviation of the population from which 
the sample is drawn. We do not know the standard deviation 
of the population but we approximate it from the standard 
deviation of the sample. (For convenience in this exposition 
we shall use s as a symbol for the standard deviation of 
the sample; er will denote the standard deviation of the 
population.) This is an acceptable approximation when 
N is reasonably large, say 30 or more. But for small values 
of N the standard deviation of the sample is subject to 
a definite bias, tending to make it consistently lower than 
the standard deviation of the population. The value of 
ffu derived by the customary method is also biased down- 

m 
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ward. Therefore, when methods appropriate to large 
samples are employed with small samples, we consistently 
under-estimate the sampling errors to which our measure- 
ments are subject. This bias shows remarkable consistency, 
however. With samples of any stated size the magnitude 
of the error to be expected from the use of the standard devia- 
tion of the sample as an approximation to the standard devia- 
tion of the population may be determined, and correction 
made for it. Accordingly, generalization of results secured 
from small samples is possible. In the nature of things the 
margin of error in such generalization is larger than it is when 
large samples are used, but the distortion due to sheer 
bias may be avoided.^ 

The nature of the error involved in generalizing from 
small samples may be brought out in the following terms. 
If we represent by M the mean of the population from 
which a sample is drawn, by X the mean of a single sample, 
and by a- _the standard deviation of a distribution of a 
number of X’s computed from successive samples, we may 
write 

y ,_ X- ikf 

C- 

X 

The quantity T is the deviation of the mean of the sample 
from the mean of the population, expressed in units of the 
standard deviation of the sample means. When is 
determined from the actual distribution of a number of 
X’s, or from the true standard deviation of the population 
and N of the sample, the quantity T may be interpreted 
as a normal deviate. The significance of given values of 
T may then be determined with reference to a table of 
areas under the normal_ curve. Actually, we do not 
have a large number of X’s, which may be arranged in a 
frequency distribution, nor do we know the value of a 

‘ The bias involved in the use of s as an approximation to a, for small sam- 
ples, was first discovered by “Student.” For the original memoir see “The 

Probable Error of the Mean/’ Bionietrlka^ YoLG^ 1908, 1-25. 
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(the standard de’mtion of the population), nor of c- (the 
standard error of X). We approximate <r by s (the standard 
deviation of the sample) and o-j by what we may call sj 

f Si = if s has been computed from a/'^^ I sj = -4=- 


if s has been computed from ^ When these ap- 

proximations are based upon small samples, the T derived 
from them may not be interpreted as a normal deviate. For 
the distribution of T varies with the size of the sample. With 
small samples the distribution departs significantly from the 
normal type. Statistical inferences that fail to take account 
of this are inaccurate. 

A discussion in detail of the distributions of statistical 
measurements obtained from small samples would carry 
us beyond the scope of the present book. We may briefly 
note, however, certain characteristics of the distribution 
function of the standard deviation. These are effectively 
revealed by the results of an interesting experiment con- 
ducted by W. A. Shewhart. 

Shewhart drew 1,000 samples, each consisting of four 
observations, from a normally distributed parent population 
with a known standard deviation, equal to unity. ^ The 
standard deviation, s, of each sample was computed. The 
distribution of these thousand values of s is represented by 
the dots in Fig. 91.* (The line running through the dots 
defines the theoretical distribution of s’s to be expected, 
with samples of 4, on the basis of “Student’s” theory. 
There is a notably close agreement between the theoretical 
and observed distributions.) Traditional sampling concepts 
would lead us to expect a normal distribution of s’s, center- 
ing about 1, the value of cr in the parent population. 
Instead, the distribution is definitely skew, with the meas- 

A. Shewhart, Ecoiwndc Control of Qvxdiiy of Manufactured Product^ 
Mm York, Van Nostrand, 1931, 103-173, 183-136. 

1 * The %ure is litm reprodneed with the permission of Dr. Shewhart and his 
ptiWIshera. 
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urements clustering about a central tendency well below 
unity. The mode of the thousand values of s here repre- 
sented is, in fact, .717 and the arithmetic mean is .801. 
These s’s, it will be recalled, represent estimates of «r. 



Standard Deviation (? 

Fig. 9L — Distribution of Standard Deviations in Samples of Four Drawn 
from a Normal Universe 


There is a clear tendency for such estimates, based on 
samples of four, to understate the true value. 

The symbol T has been used above to define the deviation 
of a statistical measure from some standard or hypothetical 
value, expressed in units of the estimated standard error 
of the measure in question, when the deviation, so expressed, 
could be interpreted as a normal deviate. In the present 
exposition we shall employ the symbol t to relate to approxi- 
mations to T when these approximations are based on 
small samples. 

The difference between T and t may be reduced to more 
definite terms. If we let x = X - M, we may write 
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We may derive t from T 


The normally distributed quantity, xjar-x, has been divided 
by the factor Sj/o-j, to give the quantity t. Opportunity to 
correct for the bias is given us, however, by the fact that 
the distribution of Si/o-j is known. Thus the probability 
corresponding to any stated value of t may be determined 
(when t defines a departure from a certain hypothetical 
value, measured in units of Sj).^ 

It is of some interest to compare values of t corresponding 
to stated probabilities, for samples of varying sizes, with 
values of T corresponding to the same probabilities. This 
is done in Table 137. 

The familiar values given in the customary table of 
areas under the normal curve appear on the last line of 

^ The degree of error involved in using s as an approximation to o', for small 
samples, is indicated by the following figures, taken from W. A, Shewhart 
{he. ciL, 185). They define the relation between the modal s, for samples of 
size N drawn from a population of which the standard deviation is known, and 
the true O' of that population. 

Size of m7nple Modal s as a decimal fraction of irm cr 


The fractions given above define relations that are to be expected on the 
basis of error theory, as modified by <*Student” to take account of conditions 
affecting small samples. The modal value of the 1,000 standard deviations 
obtained by Shewhart, in his empirical test of this theory was, as we have seen, 
,717 of the standard deviation of the universe. This result is very close indeed 
to the expected value of . 7071 for samples In which iV* * 4, 
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Table 137 

Values of t and T Corresponding to Stated Probabilities ‘ 


n Probability 



.80 

.50 

.40 

.20 

.10 

.05 

.01^ 

1 

.325 

1.000 

1.376 

3.078 

6.314 

12.706 

63.657 

2 

.289 , 

.816 

1.061 

1.886 

2.920 

4.303 

9.925 

.■'3 

.277 

.765 

.978 

1.638 

2.353 

3.182 

5.841 

.,4 

.271 

.741 

.941 

1.533 

2.132 

2.776 

4.604 

5 

.267 

.727 

.920 

1.476 

2.015 

2.571 

4.032 

6 

.265 

.718 

.906 

1.440 

1.943 

2.447 

3.707 

■7 

,263 

.711 

.896 

1.415 

1.895 

2.365 

3.499 '■ 

8 

.262 

,706 

.889 

1.397 

1.860 

2.306 

3.355 


.261 

.703 

.883 

1.383 

1.833 

2.262 

3. .250 

10. 

.260 

.700 

.879 

1.372 

1.812 

2.228 

3.169 

20 

.257 

.687 

.860 

1.325 

L725 

2.086 

2.845 

30 

.256 

.683 

.854 

1.310 

1.697 

2.042 

2.750 

00 

.25335 

. 67449 

.84162 

1.28155' 

1.64485 

1.95996 

2.57582 


Table 137, for n = oo. These are the values of T, as a nor- 
mal deviate, corresponding to probabilities of .80, .50, etc. 
Thus, when we are dealing with mfinitely large samples, 
the probability of a given sample yielding a value of T 
as great as .25335 or greater (either above or below the 
mean) is .80. (The area between the maximum ordinate 
and an ordinate erected at + .25335 is 10 per cent of the 
total area under the normal curve. Twenty per cent of 
the total area will fall within ± .25335, and 80 per cent 
will fall beyond these limits.) Similarly, just 50 per cent 
of the values of T will exceed the limits ± .67449 ; 5 per 
cent will exceed the limits ± 1.95996; 1 per cent will 
exceed the limits ± 2.57582. 

As n grows smaller each of these limits must be extended, 
if the probabilities are to remain constant. For samples 
in which n is equal to 10, 50 per cent of the values of t will 

^ The entries in this table are extracts from a more detailed table (Table IV) 
in R. A. Fisher^s Statistical Methods for Research Workers^ Edinburgh, Oliver 
and Boyd, sixth edition, 1936. The table is printed here through the courtesy 
of Dr. Fisher and his publishers. 
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fall beyond the limits d; .700; 5 per cent will exceed the 
limits ±2.228, and 1 per cent will exceed the limits 
± 3 . 169. (The letter n in Table 137 refers to the number 
of degrees of freedom in the computation of i. This general 
concept has been discussed in Chapter XV. When the 
arithmetic mean of a sample is being tested for significance, 
n = N — 1.) If in applying various statistical tests we attach 
significance to a given level of probabilities, such as 5/100 
or 1/100, we must recognize that the values of t corre- 
sponding to these probabilities vary with n. Fortunately, we 
now know how these values vary and, using such a table as 
that given above, may make allowance for the variation. 

For convenience in exposition we have distinguished T, 
as a normal deviate, from t, a similar deviate relating to a 
distribution of quantities derived from small samples, and 
therefore not normal. The probabilities corresponding to 
a given value of T are not the same as the probabilities 
corresponding to an identical value of t. Indeed, these 
probabilities vary for the same value of t computed from 
samples of different sizes. The distinction between T and 
t need not be preserved, however. We may use t generally 
to define the deviation of a statistical measure from some 
standard or hypothetical value, expressed in units of the 
standard error of the measure in question. The quantity 
i is to be interpreted as a noimal deviate when large samples 
are dealt with. The interpretation is modified in dealing 
with small samples, as we have seen. The nature of the 
modification required is shown by the entries in Table 137 
and in Appendix Table 11. 

EXAMPLES OP TESTS BASED ON i-TABLE 

In determining whether the mean of a sample deviates 
significantly from any stated value we may compute t 
from the relation 
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where X is the mean of the sample, M is the stated value 
and Ss is an approximation to the standard error of X. 
For this approximation we have 


VN 

where s is the standard deviation of the sample (here com- 
puted from The value t, which for larger sam- 

ples we have interpreted with reference to a table of areas 
under the normal curve, we here interpret with reference 
to the special i-table for small samples. In using the i!-table 
for this purpose we take n of that table as equal to N - 1. 

For the six New England states the average earnings 
of factory workers in 1935,' as indicated by census returns, 
were as follows : 


Maine 

* 851 

New Hampshire 

892 

Vermont 

940 

Massachusetts 

1,007 

Rhode Island 

938 

Connecticut 

1,016 

Average 

1 940.67 


$26.13. 


For s we obtain the figure .$63.99. The standard error of 
the mean is s$ = —7= = = $26.13. 

Vn V6 

Does the average of annual earnings of factory workers 
in the six New England states differ significantly from 
$1,022, the average for the country as a whole? Computing 
i we have 


X - M ^ $ 940.67 - $ 1,022 
ss ■ $ 26.13 


= - 3.11. 


^ These averages, and similar ones cited below, are derived by dividing tlie 
total wages paid by the average number of wage-earners employed during the 
year. Part-time workers are included. The averages do not represent full-time 
mminp, therefore. 
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Consulting the ii-table with n = 5 we find that for P = .01, 
t = 4.032. The observed deviation is not as great as this. 
If our standard is a P of .01, the average for the New 
England states is not to be judged significantly less than 
the average for the country as a whole. If the standard 
were a P of . 05, however, the deviation would be considered 
significant. 

Similarly, we may test with reference to the f-t^ble the 
significance of a difference between two means, computed 
from small samples. In this case we obtain t from the 
relation 


s 7 Ni + Ni 

where the X’s and N’s have the customary meanings, and 
s is, in effect, an average standard deviation of the two 
distributions. For 


K Ai + Aii — 2 

Here di and di are used, respectively, to denote deviations 
of given observations from the means of the two distribu- 
tions. The value /, as derived above, corresponds to t = — , 

O-jj 

where D is the' difference between two means and (Td is 
the standard error of that difference. For small samples, 
however, the customary formula for 0 -^ is modified some- 
what, and the special t-table rather than the table of normal 
deviates is used. In consulting the t-table in a problem of 
this type, n is taken as equal to iVi + As - 2. 

Average earnings of workers employed in manufacturing 
plants in six Southern states, in 1935, are shown below; 

North Carolina 1662 

South Carolina 615 

Georgia 599 

' ' ! ^ , Tennessee 744 

^ ‘ ^ Alabama 640 

^ ^ Mississippi 541 

;■ Average $S^7M 
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Does this average differ significantly from the mean earnings 
in six New England states in the same year? For the com- 
putation of s we have 

and for t 

_ $ 940.67 - $ 633.5 /36 

$ 66.06 V 12 

= 8 . 05 . 

In the i-table, for n = 10, we find that the value of t corre- 
sponding to a P of . 01 is 3 . 169. The present value is clearly 
significant. The two samples could not have come from 
one homogeneous parent population. 

The f-table has particular value in connection with the 
interpretation of coefficients of regression. We may have 
observed that a given variable, F, appears to increase by 
a constant increment or at a constant rate as another 
variable, x, changes in value. The degree of relationship 
between the two variables may be measured in terms of r, 
the coefficient of correlation, but special interest often 
attaches to the functional relationship and, in particular 
to the apparent regression of y on x. Does 6’of the equation 
of regression 

F = a -I- 

depart significantly from zero, or from some other value 
which has significance for the purpose in mind? Here we 
must judge 6 with reference to the sampling errors to which 
it is exposed. 

A general test of this type was applied in an earlier 
section (Chapter XIV), in seeking to determine whether 
average corn yield in Kansas had shown a significant decline 
over the period 1890-1933. For smaller samples we may 
compute t by exactly the methods there presented, but we 
should interpret t with reference to the special i-table 
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adapted to small samples. As a general formula we have 

where fc is a coefficient of regression and is a norm with 
reference to which we wish to judge the given value of b. 
For the standard error of 6 we have 


where 





(In these expressions, a: = X — X, Y is an observed value 
of the dependent variable, and Yc is the corresponding 
computed value.) In interpreting the value of t thus secured, 
the i-table is employed with n = A’ - 2. 

This test may be extended to the comparison of two 
coefficients of regression. The series in Table 138 provide 
an illustration. 


Table 138 

Aggregate Values of Loans m Securities and Commercial Loam, 
Reporting Member Banks, Federal 
Reseme System, 1922-1929 
(In hundreds of millions of dollars) 

Loans On Commercial loans 

securities {“ all other Imm”) 

1922 .39 73 

1923 41 78 

1924 4.0 80 

192.3 53 82 

1926 .57 86 

1927 62 87 

: , 1928 60 89 

1929 77 92 

, For loans on securities the trend (i.e., the equation of 
regression of volume of loans on time) is defined by 
i F, = 30.63 + 6.49Xi. 


' 'f 
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The corresponding equation for commercial loans is 
Fs = 72.13 + 2.54Z2. 

In each case the origin is at 1921. The eight-year period 
was marked by an increase of loans on securities which was 
much more rapid than the corresponding advance in com- 
mercial loans. We must ask, however, whether the difference 
between the two coefficients of regi-ession is really signifi- 
cant, if account be taken of sampling fluctuations. 

The coefficients to be compared are 

bi = 5.49 
h = 2.54. 

In testing whether bi — 62 is significant (i.e., deviates 
significantly from zero) we must compute 


(Ti,,-!,, being, of course, the standard error of the difference 
between the two coefficients of regression. For this standard 
error we have 

, _ , S/' 

where Xi and are given values of the .two variables, 
expressed as deviations from their respective arithmetic 
means, and 

cr 2 _ 2(Fi - F„)2 + S(F 2 - F,d^ 

Ni + m-4: 

8/ is a measure of the average scatter about the two lines 
of regression. 

In the present example we have 
= 2.40 
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For the interpretation of this value of t we enter the <-table 
with n = Ni + Ni — 4 = 12. In this case the value of t 
far exceeds the value of 3.055, corresponding to P = .01. 
The results are not consistent with the hypothesis that 
the true value of h — h is zero. The trends of the two 
series differ significantly. (Here, again, the reader should 
bear in mind that such tests of significance apply only 
with important qualifications to economic series that are 
ordered in time.) 

Sampling Ekboks op Coefficients of Cobrelation 
Computed prom Small Samples 

As a general formula for the determination of the standard 
error of the coefficient of correlation we have made use of 

— ^ ~ P 

~ VN^' 

In error theory, the r that appears in the numerator of 
the right-hand member of this equation is the coefficient 
of correlation in the universe from which the sample in 
question is drawn. But this r is not known. Our best 
approximation to it is the r derived from the sample. Here, 
again, wc face distortion in small samples, a distortion 
that is the greater the higher the value of the true correla- 
tion. The nature of this bias may be readily understood. 
If we are drawing samples from a universe in which the 
true value of r is -f- .95, the range of the possible variation 
of the sample r’s above the true r is only .05. But the 
range of possible variation below the true value is 1.95 
(i.e., from + .95 to — 1.00). Accordingly, a distribution 
of r’s obtained from a great many small samples from this 
universe will be sharply skew. An estimate of the true 
value based upon a sample value will be subject to corre- 
sponding bias. This bias will not be present when the 
population value of r is zero. (The distribution of sample 
r’s when the population value of r is zero will be symmetrical. 
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but will depart somewhat from the normal type in other 
respects.) It will not be pronounced when the samples 
are large, even for high values of r. But when samples 
are small and the population value of r departs materially 
from zero, substantial inaccuracy results from the use of 
the formula given above. 

Allowance may be made for this bias by use of the table 
showing the distribution of t, for samples of various sizes. 
R. A. Fisher has shown that the procedure employed in 
deriving t, in testing whether a coefficient of linear regression 
differs significantly from zero, may be used, with an algebraic 
modification of the mathematical expression, in determining 
the significance of r. If we are testing the hypothesis that 
a sample from which a given r has been computed was drawn 
from a population in which the true value of r is zero, we 
may compute t from the relation 



This is equivalent, of course, to dividing the quantity r - 0 
(i.e., the deviation o f the g iv en r f rom the hypothetical 
value of zero) by VT - r^ViV — 2. In consulting the 
<-table for the interpretation of the values thus obtained, 7i, 
the number of degrees of freedom, is taken as equal to N — 2. 

As an illustration, we may test the results obtained from 
a study of the relation between the production and the 
price of cotton in the United States, covering 35 observa- 
tions. The value of r is — . 65. We have 


- .65V35 - 2 

Vl -(- .65)2 


- 4.91. 


In consulting the i-table we find that for » = 33 the value 
of t corresponding to a probability of 1 per cent ' is approxi- 


^ This probability refers to the likelihood of deviations above or below the 
assumed true value of zero. It corresponds to the sum of areas at both extrem- 
ities of a frequency curve. We may divide it by two to obtain the probability 
of a deviation of the stated magnitude in one direction only from the hypothet- 
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Table 139 


Values of the Correlation Coefficient for Different Levels, of 
Significance’^ 


n 

P - .05 

■■■ P = .02 

P = .01 

1 

.996917 

.9995066 

.9998766 

2. 

.95000 

.98000 

.990000 

S 

..8783, , 

.93433 

.95873 

4 

■■ .8114 : 

■ .8822-' ■■ ■■ 

.91720 

5 ■ 

. .7545, 

'.8329 ,■ 

.8745 

6 

.7067 

.7887 

.8343 

7 

.6664 

.7498 

.7977 

8 

.6319 

.7155 

.7646 

9 

.6021 

.6851 

.7348 

10 

.5760 

.6581 

.7079 

11 

.5629 

.6339 

.6835 

12 

.5324 

.6120 

.6614 

13 

. 5139 

.5923 

.6411 

14 

.4973 

.5742 

.6226 

15 

.4821 

.5577 

.6055 

16 

.4683 

.5425 

.5897 

17 

.4555 

.5285 

.5751 

18 

.4438 

.5155 

.5614 

19 

.4329 

,5034 

.5487 

20 

.4227 

.4921 

.5368 

25 

.•8809 

.4451 

.4869 

30 

.3494 

.4093 

.4487 

35 

.3246 

.3810 

.4182 

40 

.3044 

.3578 

.3932 

45 

.2875 

.3384 

.3721 

50 

.2732 

,3218 

.3641 

60 

,2500 

.2948 

.3248 

70 

.2319 

.2737 

.3017 

m 

.2172 

.2565 

.2830 

90 

.2050 

,2422 

.2673 

100 

. 1946 

.2301 

.2540 


W TaJue. In most problems of the type here discussed it is conservative prac- 
tto to test given results with reference to the probability of a deviation of 
giwm magnitude, without consideration of the direction of deviation. The 
to-bulated valu« of t lend themaelv^ to this procedure. 

* This i»bte is printed hm through the courtesy of R. A, Fisher and his 
publishes, diw and Boyd, of Edinburgh. The original appears as Table V. A 
MBMMM M^Msfor Mmemrdk Wm^km, 
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mately 2.73. If the true value of i were zero, a value as 
great as 2.73 or greater would occur only 1 time out of 
100, as a result of chance fluctuations of sampling. The 
present value of t is substantially greater than 2.73, It 
it highly improbable that it reflects a chance drawing from 
a population in which the true value of t (and, of course, 
of r) is zero. The results we have obtained are not, then, 
consistent with the hypothesis that the true value of r is 
zero. There appears to be a significant negative correlation 
between the production and the price of cotton. 

If we are seeking to determine the significance of given 
coefficients of correlation with reference to hypothetical 
values of zero, use may be made of a table prepared by 
R. A. Fisher, showing the values of correlation coefficients 
at stated levels of significance. Selected values from this 
table are given in Table 139 and in Appendix Table III. In 
simple correlation problems, this is to be read with n equal 
io N — 2 (the number of pairs of original observations 
less 2). In determining the significance of coefficients of 
partial correlation the number of variables held constant 
is also subtracted from N. 

The use of the table requires little explanation. If a 
sample is based on 12 pairs of observations’, with n equal 
to 10, we would require a coefficient at least as high as 
.7079 before we accept it as significant, if our standard 
of significance is P = .01. For only 1 time out of 100 
trials would a sample of 12 drawn from an uncorrelated 
population yield a value of r as great as . 7079. If our 
standard of significance is P == .05 we would accept as 
significant of a real relationship an r of .5760, or greater, 
obtained from a sample of 12. 

TRANSFORMATION OF r TO 2 

The sampling limitations attaching tor have led R. A. Fisher 
to utilize as a general measure of linear correlation a loga- 
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rithmic function of r that possesses certain distinctive merits.^ 
In effecting the transformation we have 

2 = f {loge (1 + r) - log. (1 - r-)]. 

Conversely 

r = — 1) 4 - + !)• 

The scales of possible values of r and z are, of course, quite 
different. For r = 0, 2 = 0, and for r = 1, 2 = «>. Nega- 
tive values of r give negative values of 2. The relations 
between the two functions, at different levels of correlation, 
are shown by the entries in Appendix Table IV. Transfor- 
mation may be more readily effected by means of this 
table than from the relations given above. 

There are certain highly important advantages in this 
transformation. Not least is the replacing of r by a function 
with a distribution of values corresponding more closely 
to the true significance of observed correlations than do 
those of r. Thus a change in the value of r from .88 to 
.98 is equivalent, on the r scale, to a change from .20 to 
.30. But the first of these differences represents, on the 
2 scale, a change from 1.38 to 2.30 (a range of .92) while 
the second represents a change in z from .20 to .31 (a 
range of .11). The first difference, on the z scale, is over 
8 times more significant than the second. In this the z 
scale gives a far more accurate representation of the true 
significance of observed correlations than does the r scale. 

More important than this, however, is the fact that the 
distribution of z is much closer to the normal type than is 
that of r; in particular, the distribution of z is not subject, 
as is that of r, to marked variations in form with variations 
in the degree of correlation in the population. The form 
of the distribution of 2 is virtually independent of the 
degree of correlation. As a result, the sampling errors to 
which 2 is exposed may be estimated with considerable 

* See StatuHcal MMod« for Reaearch Workers, Chapter VI. 
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accuracy. For the standard error of z we have 

VA-S 

This staridard error, it is to be noted, is a function solely 
of N. It IS independent of the true value of 2 in the parent 
population. 

From the example in Chapter XVI we obtained a coeffi- 
cient of partial correlation of — . 2923 between corn yield 
per acre in Kansas and average June temperature, holding 
constant effects of changes in July and August tempera- 
tures. Referring to Appendix Table IV we have, for 
r = - .2923, z = - .301. In computing the standard 
error of a coefficient of partial correlation we must subtract 
from AT the number of variables held constant. Since iV 
equals 44 in the example in question, we treat the coefficient 
of partial correlation as we would a simple coefficient based 
on 42 observations. For the standard error of z we have, 
then, ’ 

O-j = - -=i. ■ = . 160 . 

V42 - 3 

With reference to this result we may determine whether z 
differs significantly from zero. For the test’ we must have 


We interpret 1.88 as a normal deviate. It is clear that it 
is not large enough to indicate that z is significant. The 
result is not inconsistent with the hypothesis that the true 
value of 2 (and hence of r) is zero. 

If, however, we test the coefficient rw.js = - .4057, from 
the same example (defining the relation between corn yield 
per acre and August temperature, with June and July tem- 
peratures held constant), we have 
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This result is clearly significant. So, also, is the measure 
ri 3.24 = - .6101, the coefficient of partial correlation be- 
tween corn yield and July temperature, with June and 
August temperatures held constant. 

The procedure would be similar, of course, if we were 
testing the significance of the deviation of an observed 
value of 2 from a theoretical value other than zero. 



The transformation to z makes possible, also, an accurate 
test of the significance of the difference between two 
observed correlations. The standard error of the difference 
between two values of 2 is given by 


where Ni is the number of pairs of observations in the first 
sample, the number in the second. 

This test may be illustrated with reference to observations 
on the timing of price changes during business cycles. 
For 111 commodities we have observations on the timing 
of price declines in two successive periods of business 
recession occurring in the late 90’s and early 1900’s. The 
degree of relation between the time sequences of commodity 
price changes 4n these two recessions is indicated by a 
coefficient of correlation of -1- .22. For two similar (suc- 
cessive) periods in the 1920’s the measure of correlation, 
based on the prices of 121 commodities, has a value of 
-1- .36. There appears to have been a closer approach to a 
common pattern in the later period than in the earlier. 
In testing the significance of the difference between the two 
results we set up the hypothesis that the two samples were 
drawn from the same parent population, and that therefore 
the true value of the difference between the two coefficients 


'or the two samples we have 
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r = .36; . = .377; = jL . .0085. 

The difference to be tested is 

D, = .377 - 223 = .154. 

The standard error of this difference is 


(To, = V.Q093 + .0085 = .133. 


We wish to know whether Dz is significantly different from 
zero. We compute, therefore, 


D, - 0 _ , 154 - 0 
^ T133 “ 


1.16. 


Interpreting 1.16 as a normal deviate, we conclude that 
the difference is not significant. Dz differs from the hypothet- 
ical value of zero by only slightly more than one standard 
deviation. The results are not inconsistent with the 
hypothesis that the two samples are drawinp from the 
same parent population. There is here no clear evidence 
that the degree of relationship between price movements 
in successive cycles was closer in the 1920’s than in the 
earlier period.^ 

Finally, making use of the 2 -transformation, we may 
combine results secured from the measurement of corre- 
lation in different samples. If we have two values of r, 
obtained from samples drawn from the same population, 
a weighted average of the two will provide a better estimate 
of the true correlation than will either of the r's, taken 
separately. For the averaging process we transform the 
r’s to 2 ’s, weight each z by the corresponding N, less 3, 
and average them. Then, if desirable, the corresponding 
^'alue of r may be determined. We may note that the 


( The time factor enters to cloud statistical inductions relating to samples 
drawn from different periods (see above, Chapter XIV). Such an induction 
should be supported by evidence indicating that fundamental conditions in the 
field in question have not been altered over the time interval involved. This 
caution docs not, of course, affect the procedure illustrated above. 
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standard error of the weighted average of the two g’s is 
given by 



1 

(Ax - 3) + (A2 - 


3)' 


The Chi-Sqoaee Test 

One of the great contributions of Karl Pearson to statisti- 
cal methodology was the determination of the form of 
the distribution of Chi-square, and the development of 
methods of utilizing this distribution. The character of 
this distribution and various tests based on it are our 
concern in the present section. 


THE NATURE OF CHI-SQUARE AND ITS DISTRIBUTION 

The quantity Chi-square (represented always by the 
symbol x^) is a measure of the degree to which a series of 
observed freciuencies deviate from corresponding theoretical 
or hypothetical frequencies. The theoretical frequencies 
are set up on the basis of some hypothesis, some rational 
argument. The magnitude of the discrepancy between 
theory and observation is defined by the quantity x®- It 
was Pearson’s ‘contribution to determine the nature of the 
distribution of the values of that would be obtained 
under given sampling conditions. Knowledge of this dis- 
tribution enables us to determine whether a given discrep- 
ancy between theory and observation may be attributed 
to chance, or whether it results from the inadequacy of 
the theory to fit the observed facts. This instrument is 
obviously one of extreme importance in statistical analysis. 

The character of the distribution of x® may be discussed 
with reference to Weldon’s date relating to the results 
obtained in 4,096 throws of 12 dice (see page 433). We 
caJl a 4, 5, or 6 spot a success, a 1, 2, or 3 spot a failure. 
When 12 dice are thrown the expected (or theoretical) 
number ci successes on each throw is 6. A deviation from 
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6 represents a discrepancy between expectation and observa- 
tion. From the result of each throw of 12 dice a value of 
may be computed. Thus, a given throw yields 2 successes 
and 10 failures. The 2 successes represent a deviation of 4 
from the expected value of 6; the 10 failures represent a 
deviation of 4 from the expected value of 6. (In such an 
experiment as this there are two components of each value 
of xV even though when one component is given the other 
is necessarily determined. For the sum of successes and 
failures must be 12 on each throw.) The value of x^ in 
a given instance is obtained by squaring the discrepancies 
between expectation and observation, dividing the squared 
values by the corresponding expected values, and adding 
the quantities thus obtained. That is 


/ 

where /o denotes an observed frequency and / defines the 
corresponding theoretical frequency. 

In the case cited above we have 


,. 2 _( 2 - 6 )^ ( 10 - 6 )» 

6 6 


5.333. 


On another trial, with 7 successes and 6 failures, we have 




On still another trial, giving 6 successes and 6 failures, we 
have 


(6 - 6 )^ (6 - 6 )^ 

X 6 • 6 “ 


The 4,096 throws thus yield 4,096 values of x*- Tabulating 
these with respect to the frequency of occurrence of stated 
values, we obtain the distribution given in Table 140 on 
page 620. 

This table gives us information as to the nature of the 
discrepancies between theoretical norms and actual results 
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Table 140 


Tabulation of 4,096 Observed Valves of 


Value ^ of X“ 

[measuring deviation of 

Frequency of 

Frequency of 

observat ion from expec- 

occurrence 

occuTfence 

tancy in dice-throwing 

(absolute) 

(felatwe) 

experiment) 

0 to .833 

2,526 

.6167 

.833 to 2.167 

966 

,.2358 

2. 167 to 4. 167 

455 

.1111 

4. 167 to 6.667 

m 

.0320 

Over 6.667 

18 

, .0044,'. 

Total 

4,096 

'■ l.OOOO' 


that chance may bring about. For deviations from the 
expected frequency of successes, 6, may be attributed to 
the mass of undifferentiated causes we call chance. The 
magnitude of varies, of course, with the degree of devia- 
tion. Values of x“ not exceeding .833 are most frequent. 
Higher values of x® occur with decreasing frequency. Only 
18 out of 4,09(3 observed values of x^ exceed 6.667. This 
distribution furnishes us, therefore, with a standard of 
reference to employ when seeking to determine whether 
a given discrepancy between theoretical and observed values 
is attributable to chance, or whether it is too great to be 
so explained. 

This use of the table, as a standard for determining 
the probability that given discrepancies between theory 
and observation are attributable to the play of chance, 
is facilitated by a somewhat different arrangement. We 
! may set up a table of cumulative values, based upon the 

' ' * The 4,006 values of x* tabulated here ormstitute a discrete series. The 
! conditions of the experiment are such that the 4,096 obsorvations on x’ are 
' distributed among only six values, ranging from 0 to 8.333. In order that the 
obsehred froqueneaes of oocurnaioe of static values of x* may be compared (in 
a lAfcr table} with theoretical frequencies, an uneven clans-interval is em- 
i^yed above. Oaos limits are taken midway between suoeesaive values at 
Which the actual obwrvatioi» fall. (The decimal fractions used in the table do 
not define the«* limits witti full apsuracy.) 
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tabulation of the 4,096 values of obtained in the preceding 
experiment. These are given in Table 141. 

The entries in col (2) of this table indicate that in the 
experiment involving 4,096 throws of dice, a value of x" 
of 6.667 or more occurs less frequently than 1 time out 
of 100 (only 44 times out of 10,000, in fact). A value as 
great as 4.167, however, occurred more frequently than 
3 times out of 100. If we interpret these relative frequencies 
as probabilities, we may obtain from such a table a knowl- 
edge of the^ probabilities corresponding to stated values f 

of Here is the instrument we desire, in seeking to deter- 
mine whether given observations conform closely enough 
with expectations based on theory, or on working hypotheses 
which perhaps are not yet ready to be dignified as theories. 


Table 141 

Cumulative Relative Frequencies of Occurrence of 4,096 Observed 
Values of with Corresponding Theoretical Frequencies ^ 


( 1 ) 

Value of X"’ 
(cumulative deviation 
of observation 
from expectancy) 

0 or more 
. 833 or more 

2. 167 or more 

4. 167 or more 
6.667 or more 


(2) 

Relative frequency 
of occurrence 
(observed) 

1.0000 

.3833 

.1475 

,0364 

.0044 


(3) 

Relative frequency 
of occurrence 
(theoretical) 

1.0000 

.3613 

.1411 

.0412 

.0098 


We should note two important limitations attaching to 
the entries in col. (2) of the above table, showing relative 
frequencies corresponding to stated values of x^- In the 
first place, these are merely empirical results, obtained 
from a given set of experiments. The conditions of the 
experiment yield a discontinuous series of values for x^- 
In some degree, this discontinuity has been ironed out by 

^ One degree of freedom is present in the determination of a single value of 
X^ in this example. 




^ I f I 'r ' 




622 STATISTICAL INDUCTION 

the method of classification employed, but the instrument 
derived from this single experiment remains an imperfect 
one. The effects of chance fluctuations are present in 
these results, also, and contribute to the imperfection of 
the instrument. The true distribution of x® is only approxi- 
mated by the results presented in col. (2) of Table 141. 

The entries in col. (3) of Table 141 are free of this lim- 
itation. These record the frequencies with which values 
of X® falling within the limits indicated in col. (1) might 
be expected to occur, on the basis of mathematical theory, 
under the conditions of the present experiment.^ These are 
the entries which provide the standard we desire, in deter- 
mining the significance of a given series of discrepancies 
between observation and expectation. It is to be noted, 
however, that the empirically derived table constitutes a 
fair approximation to the theoretical distribution of x® 
under these conditions. 

The second limitation attaching to the example cited 
above is that each of the 4,096 values of x* tabulated has 
two components, and that the experiment is such that 
when one component is given the second is necessarily 
determined. (Since there are 12 events in each throw we 
know, for example, that if we have 8 successes there must 
be 4 failures.) This condition is described by saying that 
there is but one degree of freedom in the derivation of 
a given value of x*- The table we have obtained relates, 
therefore, to a special case — the distribution of values 
of x“ computed with one degree of freedom. There are 
other possible cases. For each of these the distribution of x" 
may be determined in a manner similar to that shown above. 

As an example of a different set of conditions we may 
consider the outcome of a throw of 24 dice, account being 
kept of the frequency of occurrence of each possible result 

* tlieise relative frequencies are taken from G. Udney Yule “Table of the 
values of P for divergence from independence In the fourfold table,” Journal 
4 the Bogal StaiiMcal Soeidv, Yol. LXXXV, January, 1922, 103-104. 
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(i.e., the appearance of a 1, 2, 3, 4, 5, or 6 spot). When 24 
dice are thrown there may be expected 4 one spots, 4 
two spots, 4 three spots, etc. In a given throw we obtain 
the following results: 


Observed frequency 
Expected frequency 


Number of spots 
3 4 


5 

4 


6 

4 


4 

4 


3 

4 


For the results of this throw the value of Chi-square would 
be given by 

= ■(g.- i)' + (5 - 4)^ (6- 4)° ■ (4 - 4)° 


+ 


+ 

(3 - 4)= 


2.50. 


This quantity has six components. However, as soon as 
five are given the sixth is determined, since the total number 
of events is fixed at twenty-four. There are, then, five 
degrees of freedom in the calculation of in this experiment. 

If the 24 dice were thrown a thousand times, say, we 
should have one thousand values of A distribution 
of these could be constructed, similar to that derived 
empirically for the case in which there Was one degree 
of freedom. It would be a different distribution, however, 
for the change in degrees of freedom has an obvious relation 
to the magnitude of x". The character of the distribution 
of the values of x^ that would be obtained in such an experi- 
ment is indicated by the entries in Table 142 on page 624. 
We do not here give empirical values, as in the preceding 
example. The table shows the theoretical frequencies with 
which given values of x^ occur, when five degrees of freedom 
prevail. 

In using tables of this sort we may interpret meas- 
ures of -relative frequency as probabilities. Thus we may 
read Table 142, which relates to the distribution of x^ 
computed with five degrees of freedom, as follows: If 
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Table 142 

Tabulation of Computed with Five Degrees of Freedom, with 
Cumulative Relative Frequencies '■ 


Value of 

Relative frequency 
of occ urrence 

0 or more 

{theoretical) 

1.0000 

1 or more 

.9626 

2 or more 

.8491 

3 or more 

.7000 

4 or more 

.5494 

5 or more 

.4159 

6 or more 

.3062 

7 or more 

.2200 

8 or more 

. 1562 

9 or more 

.1091 

10 or more 

.0752 

11 or more 

.0514 

12 or more 

.0348 

13 or more 

.0234 

14 or more 

.0156 

15 or more 

.0104 

16 or more 

.0068 

30 or more 

.000015 

00 

.000000 


the true value of x’’ is zero (i-e-j in an infinitely large sample 
observed frequencies would agree precisely with the theo- 
retical frequencies we have set up), the probability of our 
securing a x® of zero or more, from a sample of the type 
here employed, is 1.00; the probability of our securing 
a X* of 1.00 or more is 9,626/10,000; the probability of 
our securing a x^ of 3.00 or more is 7/10; the probability 
of our securing a x* infinitely large is 0. The quantities 
X® and P stand, thus, in a definite functional relationship, 
for any ^ven value of n (n denotes the number of degrees 
of freedom). At the two limits the relationships are the 

^ ^ From tto table by W, F. EWerton and given in fMe$ for Static 

iMmm and IKwl editor, 26 . The of Elderton^s table 

it eciml to n + foran eampt© of tbe type here given. 
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Table 143 * 


Table of for Selected V alms of P and n 


n 

P = .99 

.95 

.50 

.10 

.05 

.02 

.01 

1 

.000157 

.00393 

.455 

2.706 

3.841 

5.412 

6.635 

2 

.0201 

.103 

1.386 

4.605 

5.991 

7.824 

9.210 

3 

. 115 

.352 

2.366 

6.251 

7.815 

9.837 

11.341 

4 ■ 

-297 

.711 

3.357 

7.779 

9.488 

11.668 

13.277 

5 

.554 

1.145 

4.351 

9.236 

11.070 

13.388 

15.086 

6 

.872 

1.635 

5.348 

10.645 

12.592 

15,033 

16.812 

7 

1.239 

2.167 

6.346 

12.017 

14.067 

16.622 

18.475 

8 

L646 

2.733 

7.344 

13.362 

15.507 

18.168 

20.090 

9 

2.088 

3,325 

8.343 

14.684 

16.919 

19,679 

21.666 

10 

2.558 

3.940 

9.342 

15.987 

18.307 

21.161 

23.209 

11 

3.053 

4.575 

10.341 

17.275 

19.675 

22.618 

24.725 

12 

3.571 

5.226 

11.340 

18.549 

21,026 

24.054 

26.217 

13 

4.107 

5.892 

12.340 

19.812 

22.362 

25.472 

27.688 

14 

4.660 

6.571 

13.339 

21.064 

23.685 

26.873 

29.141 

15 

5.229 

7.261 

14.339 

22.307 

24.996 

28.259 

30.578 

16 

5.812 

7.962 

15.338 

23.542 

26.296 

29.633 

32.000 

17 

6.408 

8.672 

16.338 

24.769 

27.587 

30.995 

33.409 

18 

7.015 

9.390 

17.338 

25.989 

28.869 

32.346 

34.805 

19 

7.633 

10.117 

18.338 

27.204 

30.144 

33.687 

36. 191 

20 

8.260 

10.851 

19.337 

28.412 

31.410 

35.020 

37 . 566 

21 

8.897 

11.591 

20.337 

29.615 

32.671 

36.343 

38.932 

22 

9.542 

12.338 

21.337 

30.813 

33.924 

37.659 

40.289 

23 

10,196 

13.091 

22.337 

32.007 

35.172 

38.968 

41.638 

24 

10.856 

13.848 

23.337 

33.196 

36.415 

40.270 

42.980 

25 

11.524 

14.611 

24.337 

34.382 

37.652 

41.566 

44.314 

26 

12.198 

15.379 

25.336 

35.563 

38.885 

42.856 

45.642 

27 

12.879 

16.151 

26.336 

36.741 

40.113 

44.140 

46.963 

28 

13.565 

16.928 

27.336 

37.916 

41,337 

45.419 

48,278 

29 

14.256 

17.708 

28.336 

39.087 

42,557 

46.693 

49.588 

30 

14.953 

18.493 

29.336 

40.256 

43.773 

47.962 

50.892 


same for all values of n. When =0, P = 1.00; when 
x° = oo, P = 0.00. But for intermediate values the rela- 
tionship varies with n. 

In 1900 Karl Pearson defined the distribution function 

‘ This table is reproduced here through the courtly of R. A. Fisher and his 
publishers, Oliver and Boyd, of Edinburgh. The entries are taken from 
TahleUloi Statistical Methods far Research Work&'s- 
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of X®* ^ The actual application of the x^ test is facilitated by- 
prepared tables. Selected entries from these tabulations, 
for different values of n, are given in Table 143 on page 625 
and in Appendix Table V. 

For determining the significance of x® beyond the range 
of this table, Fisher has given V'2x^ — ^2^— 1, as a value 
which may be interpreted as a normal deviate. That is, 
the figure derived when stated values of x^ and n are inserted 
in the above expression is to be taken as a deviation from 
the mean of a normal distribution, expressed in units of 
the standard deviation of that distribution. The corre- 
sponding value of P is then derived from a table of areas 
under the normal curve. 

The x^ test is appheable to a considerable variety of 
problems. Wherever, on rational grounds, a set of theoretical 
frequencies may be derived, for comparison with observed 
frequencies, this test is appropriate in judging of the 
significance of the discrepancy between the two sets of 
frequencies. In the following pages three applications of 
this test are exemplified. 

THE CHI-SQUARE TEST OP GOODNESS OP FIT 

f 

When an ideal frequency curve, whether normal or of 
some other type, is fitted to an actual frequency distribution, 
theory and observation are being compared. A test of the 
concordance of the two (i.e., of goodness of fit) may be 
made by inspection, but such a test is obviously inadequate. 
Precision may be secured by employing the x* test. The 
example in Table 144, relating to the distribution of tele- 
phone subscribers discussed in Chapter XIII, illustrates the 
; procedure. 

There are 15 classes in this distribution. Since the total 

^ d* On the Criterion that a Given System of Deviations from the Probable 
in Hie d«e a Correlated System of Yariables is such that it can be Reason- 
^ly Su|^|ww} to have Arisen from Random Sampling/^ Philomphical Maga^ 
mm^ §th VoL L, 1900. 
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Table 144 

Compidalion of for Testing Goodness of Fit 
Normal Curve of Error Fitted to Distrihviion of Telephone Suhscribers 


(1) 

(2) 

(3) 

(4) 

(5) 

Class 

limits 

Observed 

Theoretical 


frequency 

/o 

frequency 

f 

(fo-f) 

f/o -“/)* 

/ 

150 and less 

10 

13.14 

- 3.14 

.75 

150-200 

19 

16.76 

+ 2.24 

.30 

200-250 

38 

31.57 

+ 6.43 

1.31 

250-300 

50 

53.02 

- 3.02 

.17 

300-350 

95 

79.43 

+ 15.57 

3.05 

350-400 

85 

106.10 

-21.10 

4.20 

400-450 

115 

126.41 

- 11.41 

1.03 

450-500 

132 

134.31 

- 2.31 

.04 

500-550 

144 

125.50 

+ 18.50 

2.73 

550-600 

116 

106.51 

+ 9.49 

.85 

600-650 

79 

81.85 

- 2.85 

.10 

650-700 

54 

55.21 

- 1.21 

.03 

700-750 

31 

33.19 

- 2.19 

.14 

750-800 

11 

17.81 

- 6.81 

2.60 

More than 800 

16 

14.19 

4- 1.81 

.23 


995 

995.00 

15 groups 

X’* = 17.53 


theoretical frequencies must equal the total observed fre- 
quencies, the entry in the fifteenth class is *fixed when the 
other 14 are established. The given value of x^ 17.53, 
is determined, therefore, with 14 degrees of freedom. From 
Table 143 we see that when n = 14 a value of as great 
as 23.685, or greater, would occur purely as a result of 
chance in 5 out of 100 random samples, if the true value 
of X* were zero. The value of 17.53 secured above is not 
excessively high, therefore. The discrepancies between the 
observed and theoretical frequencies in Table 144 could 
easily have arisen as a result of chance. The fit obtained 
with the normal curve is acceptable. Which is to say that 
our results are not inconsistent with the hypothesis that 
the normal law of error defines the distribution of residence 
telephone subscribers, classified on the basis of message use. 
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In applying the Chi-square test it is not necessary to 
determine the exact probabihty corresponding to a stated 
value of X*- Our purpose, in general, is to ascertain whether 
observed results are or are not consistent with the hypothesis 
on which the fitting procedure is based. For this purpose 
we wish only to know whether the value of P corresponding 
to a given value of x® falls below (or, much more rarely, 
above) certain critical values. As a conventional limit .05 
is usually employed. If a value of x^ is such that P is below 
.05, the discrepancies between observed and theoretical 
values are, on this standard, considered too great to be 
attributed to chance. The hypothesis on the basis of which 
the theoretical frequencies have been determined is suspect, 
in such a case. If x® is large enough to give values of P 
below .02 or .01, the inadequacy of the hypothesis is, 
of course, more strongly indicated. 

R. A. Fisher points out that suspicion should attach 
to very low values of x^, which give values of P of .99 
or thereabout. These values indicate a very close agreement 
between the hypothesis and the observed facts. Such close 
agreement may be due to chance, but there is strong proba- 
bility that the hypothesis is at fault or, in mathematical 
terms, that the wrong function is being used. Coincidence 
of observed and theoretical values suggests the kind of 
agreement one obtains by fitting to n points a curve in 
the equation to which there are n constants. Any artificial 
forcing of agreement between hypothesis and observation 
of course invalidates the application of the Chi-square test . 

In applying the Chi-square test it is convenient to use 
the conventional standards w^e have noted, as guides to 
the rejection or provisional acceptance of working hypothc- 
w. It is unwise to use these standards arbitrarily, however. 
No single standard possesses significance in any absolute 
Whse. . The investigator in a given field of research will 
'interpret the information such a test yields in the light of 
1 other knowWge relating to that field of experience, and with 
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due regard to the rational foundation of the hypotheses 
being tested. 

One feature of Table 144 requires explanation. It will 
be noted that in the construction of this table the three 
classes at the lower end of the distribution have been lumped 
into one, and that the same thing has been done with the 
six classes at the upper end of the distribution (Cf. Tables 
109 and 144) . This is done to avoid the undue magnification 
of slight differences between the tails of the observed and 
theoretical distributions. When/, the theoretical frequency, 
is very small, a relatively shght absolute discrepancy between 
/o and / may serve to swell materially the value of x^- The 
lumping process is designed to prevent such a distortion. 
Since the selection of classes for combination rests on the 
personal judgment of the investigator, a subjective element is 
necessarily introduced here. However, the results of the 
test will not usually be much affected by minor variations 
in the combination of tail-end classes. ^ 

The use of in testing the fit of theoretical frequency 
curves is subject to another rather important limitation. 
In the computation of x^ no account is taken of the distribu- 
tion of discrepancies between fo and /. Yet the manner in 
which these discrepancies are distributed may materially 
influence our judgment as to the goodness of a given fit. 
In such an example as that given in Table 144, the successive 
values of fo — /? counting from the lower limit of the a;-8cale, 
might be alternately positive and negative. Something 
approaching this alternation would be expected if chance 
factors alone accounted for the differences between observed 
and theoretical frequencies. But the differences might be 
distributed otherwise. All the values of /o — / below the 

^ Considerations of the same sort suggest that a sample of reasonable size is 
needed for the valid application of the Chi-square test in curve fitting. Deming 
and Birge set 500 observations as the minimum required in a test of this type, 
if confidence is to be placed in the result. Yule and Kendall suggest a smaller 
number, but place emphasis on the need of an adequate number of theoretical 
observations (preferably not less timn 10) in every class. 
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non-durable categories in exactly the same way, 32.21 per 
cent in the durable class, 67 . 79 per cent in the non-durable. 
These expected frequencies, which conform to our h3rpothe- 
sis that the two principles of classification are independent, 
are given in Table 146. 


Table 146 


Expectation 

Two-foid Classification of 208 Commodities 
Commodity group Expected frequencies 


Durable goods 
Non-durable goods 
Total 


Number 
preceding 
general index 
m price rise 

18.04 

37.96 


Number 
lagging behind 
generalindex 
on price rise 

48.96 

103.04 


Total 


56.00 


152.00 


67 

m 

208 


Chi-square is computed from the relation =2 


in the following manner : 


ifo-fr 


(6 


18.04)2 (50 - 37.96)2 (61 

1 ■ ■- oinr - ’T* 


48.96)2 


18.04 


37.96 


48.96 


+ 


(91 - 103.04)2 


103.04 


= 16.222. 


There are four components of Chi-square in this instance, 
but, as may readily be seen by reference to the table of 
expected frequencies, only one degree of freedom enters 
into its computation. The expected frequencies must yield 
the four group totals, 56, 152, 67, and 141. Accordingly, 
as soon as we fill one of the four cells set up by the process 
of cross-classification, the other three are definitely deter- 
mined. Given 18.04, the expected number of durable 
goods preceding the general index in price revival, the 
eattie® in the other cells axe fixed. Subtraction of 18,04 
from 66 and 67 will fill two of them, and the filling of these 
determines the fourth. 


TEST OF HOMOGENEITY 6S3 

For the interpretation of the given value of Chi-square 
we turn to Table 143, which is to be read with n, the number 
of degrees of freedom, equal to 1. If durability of economic 
goods has no relation to the timing of their price changes 
in revival, the two principles of classification employed 
above are independent and the true value of Chi-square 
is zero. Are the observed results consistent with this 
hypothesis? The entries in Table 143 indicate that if the 
true value were zero, a value as great as 3.841 would 
occur 5 times out of 100, as a result of chance fluctuations. 
A value as great as 6.635 would occur only 1 time out of 
100. The present value of Chi-square, 16.222, represents 
a still smaller probability. The results are not consistent 
with the hypothesis we have set up. The differences between 
the observed and expected frequencies are too great to be 
attributed to the play of chance. Durability, and factors 
of demand and supply related thereto, appear to play a 
definite role in the timing of price advances in business 
revivals. 

This test, it should be noted, does not define the relation- 
ship between durability of goods and the timing of price 
revival. It leads us to reject the hypothesis that durability 
has no bearing on the sequence of price advances in revival. 
If, on the basis of some other rational hypothesis, we could 
obtain a set of expected frequencies representing a definite 
relationship other than one of independence, this hypothesis 
could be tested in the same manner. From the present 
evidence, however, we may only conclude that the proportion 
of durable goods preceding the general price index on revival 
is smaller and the proportion of non-durable goods larger 
than would be expected if durability had no relation to 
the timing of price recovery after a business depression. 

THE CHI-SQTJAEE TEST OP HOMOGENEITY 

For each of eight major industrial groups we have records 
showing, for the year 1933, the number of corporations 
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reporting net incomes from their operations and the number 
reporting no net incomes (i.e., suffering deficits). The 
returns relate to a total of 492,649 corporations. Is this 
total a homogeneous whole, or does the division of corpora- 
tions between those earning net incomes and those suffering 
deficits vary significantly from group to group? The records 
appear in Table 147. 

Table 147 

Comparison of Observed and Theoretical Frequencies 
(Tabulations based on corporate income tax returns for 1933, by major 
industrial groups 


(1) 

(2) 

(S) 

(4) 

(5) 

(6) 

(7) 



Actual 

number 

Theoretical 





Total 

of 

{expected) 




Group 

number 

returns 

number of 




of 

showing 

returns 





returm 

m net 

shomng na 






income 

ifo) 

net income 

(f) 

/o-/ 


Cfo-jf)* 

■ /' : 

Agriculture and 







related indua- 
tries 

10,490 

7,818 

7,160 

•f 668 

446,224 

62.4090 

Mining and 







quarrying 

17,147 

8,856 

11,688 

- 2,822 

7,963,684 

681.3666 

Manufacturing 

93,833 

62,296 

63,968 

- 1,663 

2,766,569 

43.2404 

Construction 

18,234 

14,122 

12,428 

+ 1,684 

2,836,866 

228.1828 

Transportation 

# 






and other pub- 
lic utilities 

24,302 

14,349 

16,664 

- 2,216 

4,906,225 

296.1980 

Trade 

137,858 

93,621 

93,966 

- 344 

118,336 

1.2594 

Servie©. ' 

47,843 

36,419 

32,610 

+ 2,809 

7,890,481 

241.9660 

f'lnaace '' 

142.942 

99,314 

97,431 

+ 1,883. 

3,645,689 

'36.3918 

Total 

492,649 

336,794 

335,794 



1,691.0019 

Per cent 

100.000 

68. 161 





The observed frequencies are, of course, the actual returns 
given in col. (3) of Table 147. A set of theoretical or 
expected frequencies, for comparison with these, may be 
set up on the assmnption that all corporations in the United 
States constituted a homogeneous population as regards 
the likelihood of suffering a deficit in 1933. On this as- 

‘ Frotn StaMsfys o/ Income for ISSS. U. 8. Treasury Department, Washing- 

ton, D. G. 
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sumption we may say that the probability of failing to 
earn a net profit was, for all the elements of this assumed 

homogeneous population, or .68161. If this is the 

true probabihty for all elements of the population, we may 
determine a theoretical frequency for each industrial group 
by applying this ratio to the total number of corporations 
in that group. On the assumption made we should find, 
in all groups, the same proportionate division between 
corporations earning net incomes and those sxiffering deficits, 
except for modifications due to fluctuations of sampling. 
The expected frequencies appear m column (4). If the 
hypothesis of homogeneity is valid, these are the true 
frequencies for the several groups. Differences between 
these and the observed frequencies reflect the play of chance 
alone. 

The calculation of x^, measuring the degree of discrepancy 
between the observed and theoretical frequencies, is shown 
in cols. (5), (6), and (7) of Table 147. The value of x^, 
computed with 7 degrees of freedom, is 1,591.0019. Since 
the 1 per cent value of x*, for « = 7, is only 18.475, the 
conclusion is clear that the discrepancy is too great to 
be attributed to chance. The results are not consistent 
with the hypothesis of homogeneity. We are not justified 
in assuming that the forces affecting the profitability of 
corporate operations in 1933 were the same, among the 
eight major industrial groups here represented. 

The various procedures discussed in this chapter give 
some indication of the variety and power of the methods 
available for use in interpreting and appraising the results 
of statistical research. Each one involves, in some form, 
the testing of hypotheses against evidence yielded by the 
study of samples. It should be emphasized that the formal 
procedures described in the preceding pages are employed at 
a rather late stage in actual research work. The experiment 
will have been planned, the field work done, hypotheses 
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framed before the tests here discussed can be applied. 
These various steps must, of course, be coordinated. The 
data must be gathered with reference to the hypotheses 
to be tested and to the analytical methods to be employed. 
Acquaintance with appropriate techniques is one pre- 
requisite of intelligent planning of research in which quanti- 
tative data are utilized. Familiarity with the characteristics 
and limitations of the available materials, and clear definition 
of the questions at issue, are equally important elements. 
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THE METHOD OF LEAST SQUARES AS APPLIED 
TO CERTAIN STATISTICAL PROBLEMS 

The method of least squares in the case of a single 
unknown quantity is merely a procedure for obtaining the 
most probable value of that quantity from a number of 
separate observations. The most probable value is that 
for which the sum of the squares of the deviations (or 
residuals) is a minimum. Thk is the arithmetic mean of 
the observations. 

Where the measurements or observations do not relate 
directly to a single unknown quantity, but to functions of a 
number of unknown quantities, the problem is somewhat 
different. In the first case mentioned each observation is 
in the form of a single magnitude. In the present case 
each observation is in the form of an observation equation 
in which the observed values of the variables, as found in 
combination, are entered. The unknown quantities are 
the constants which define the functional relationship 
between the variables in question. Our problem is that 
of finding the most probable values of these constants, the 
true values being unknown. 

As in the simpler case the most probable values are 
those for which the sum of the squares of the residuals 
is a minimTim. In this case, however, the residuals are 
deviations, not from a single magnitude, as in the case 
of the arithmetic mean, but from the curve which describes 
the most probable fxmctional relationship. The residuals 
axe the differences between the computed and the actual 
values of the dependent variable. 
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DERIVATION OF THE NORMAL EQUATIONS 

Representing by Y an observed value of the dependent 
variable, by the corresponding computed value, by v the 
residual, or difference between Y and F., and by TFi, W^, 
W3, and Wi different independent variables (or different 
functions of a single independent variable), we may write 

Yc=fiWi,W2,Wz,Wi) 

v=Y,-Y 

= /(IFi, TFj, TF3, TF4) - F 
= S[/(Fi, W„ TFa, TF4) - Y]\ 

If the function in a particular case is of the type 
Yc = alFi + bw, + cFa + dW^ 

we have 

2(y“) = S[(a}fi + bWi + cWz + dWi) - Y]K 

Our problem is that of determining the most probable 
values of the constants that define the function. These 
constants are represented, in the present case, by a, b, c, 
and d. (The TV's, it should be noted, refer to quantities 
which are known, once the observation equations are given. 
In the usual case the TF’s are different funotions of a single 
variable, but this is not essential.) On the assumption 
that the errors of observation are distributed in accordance 
with the normal law of error, it may be demonstrated 
that the most probable values of a, b, c, and d, in the above 
equation, are those which render S(a’*) a minimum; i.e., 

S[(oWi + bWi + cWz + dWi) - YY = a minimum. (a) 

The normal equations necessary for the solution may be 
obtained by equating to zero the partial derivatives of 
the above expression with respect to the unknowns, a, b, 
c, and d. That is, we first differentiate the above function 
with respect to a, holding h, c, and d constant, then with 
respect to 6, holding a, c, and d constant, then with respect 
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to c, holding a, 5, and d constant, then with respect to d, 
holding a, b, and c constant. Carrying through this operation 
with respect to a, we have 

— SKaTTi + 6F2 + cFs + dW,) - 7]=“ = 0 
ou 

I SFi[(aFi + 6F2 + cFs + dW^) - 7] = 0. 
Differentiating equation (a) now with respect to &, we have 

I- 2[(aFi + 6F2 + cFs + dWi) - 7]“ = 0 

00 

or 

II SFaUaFi + 6F2 + cFs + dFd - 7 ] == 0. 
Differentiating equation (a) with respect to c, 

^ S[(aFi + &F2 + cF,, + fiFd - 7]2 = 0 

, . . ■ ■ 

or 

III SFaKaFi + 6F2 + cW, + dFd - 7] = 0. 
Differentiating equation (a) with respect to d, 

2r(aFi + 6F2 + cFa + dF 4 ) - 7 ]'-* = 0 
aa 

or 

IV SF4[(aF, + 6F2 + cFs + dF4) - 7 ] = 0. 

The most probable values of the quantities a, b, c, and 
d are secured by solving simultaneously the four normal 
equations thus obtained (numbered above I, II, III, IV). 

. FORMATION OF THE NORMAL EQUATIONS 

. When the observation equations are all of the first degree 
(i.e., of the first degree with respect to the unknown quan- 
tities, o, b, c, etc.) the normal equations may be secured 
by the following process; 
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1. Write the equation which describes the assumed relationship. 
The observation equations are derived by substituting in this 
equation the observed values of the variables^ as found in com- 
bination. 

2. Multiply each observation equation by the coefficient of the 
first unknown in that equation; the sum of the resulting equations 
constitutes the first normal equation. 

3. Multiply each observation equation by the coefficient of the 
second unknown in that equation; the sum of the resulting equa- 
tions constitutes the second normal equation. 

Continue this process until normal equations equal in 
number to the unknown quantities are obtained. 

The actual process of forming the normal equations in 
curve fitting may be simplified, and the writing out of the 
separate observation equations avoided, as was demonstrated 
in earlier sections. The following may be laid down as 
general rules for the formation of the desired normal 
equations: 

1. Write the equation of the curve to be fitted. For the purpose 
of this explanation we may employ the general form 

Y = aFi + 6F2 + cW, + dWi + . . . (1) 

where Y represents the dependent variable, a, 6, c, . repre- 
sent the constants in the equation (the unknown quantities in the 
present instance) and Wi, W2, W3, Waj . . . represent the coeffi- 
cients of these unknowns. It is assumed that these coefficients 
represent variables, and that term is used mth reference to them. 
Call this equation (1). 

2. Multiply each term in equation (1) by the coefficient of the 
first unknown in (1) (i.e., by TFi) and place the summation sign, 
S, before each variable. This is the first normal equation (I). 

3. Multiply each term in equation (1) by the coefficient of the 
second unknown (i.e., by W2) and place the summation sign before 
each variable. This is the second normal equation (II). 

4. Multiply each term in equation (1) by the coefficient of the 
third unknown (i.e., by Wz) and place the summation sign before 
each variable. This is the third normal equation (III). 

5. Multiply each term in equation (1) by the coefficient of the 
fourth unknown fi.e.. bv Wa) and nkce/fee summation siei before 
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The process may be continued until normal equations 
equal in number to the unknown quantities are obtained.^ 

A STANDARD SET OF NORMAL EQUATIONS 

As a set of generalized normal equations secured by the 
above process and applying to any equation which can be 
put in the form 

F = oTFi + 61^2 + cWz + dW t + . . . , 

we have 
I S(TFxF) 

= aS(Tfi®) + hSiWiWi) + c2(TFi1F3) + dE(WiWi) + . . . 

II SCTFjF) 

= aSCTFiTFa) + 62(1^2*) + c2(F2F3) + dZ{WzW,) + . . . 

III 2 (F 3 F) 

= aZiWiWz) + h-ZiWzWz) + c2(F3“) + tlEiWzW,) + . . . 

IV 2 (F 4 F) 

= fl2(FiF4) + 62(F2F4) + c 2(F3F4) + d2(F4') + . . . 

By substituting for Fi, F 2 , Fs, F 4 , etc., the particular 
functions employed in a given case, these equations may 
be readily adapted to any type of curve in the fitting of 
which the method of least squares is applicable. Thus in 
fitting a curve represented by the equation 

Y = a+bX + cX^ + dX^ 

substitutions in the standard normal equations given above 
are based upon the following relations: 

Fi - 1 
F 2 = J 
F, = 

F 4 = : 

The changes to be made in the normal equations are 
obvious. 2 (FiF) becomes S(F); 2 (Fi*) is equivalent to 
S(l“), which is equal to N, the total number of observations. 

‘These rales represent an adaptation of a similar series fonnulated by 
Baymond Peari in Medical Bicmut^ and Statistics, 341. 
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The first normal equation becomes 

S(F) = Na + 62(Z) + cE(X^) + dZ(X*). 

The other normal equations are modified correspondingly. 

In the example just given, the coeflEicients are all different 
functions of a single independent variable, X. It is not, of 
course, essential to the method of least squares that this 
be so. The coefficients, Wi, W^, Wa, etc., may represent a 
number of independent variables, as in the case of multiple 
correlation. 

The limitations to the method of least squares must be 
borne in mind in maldng use of it. This method, in its 
direct application, is limited to cases in which the equation 
to the curve to be fitted is linear m the constants, i.e., the 
observation equations must all be linear as regards the 
unknown values, a, h, c, etc. (This does not mean, of course, 
that the equation to the fitted curve must be linear.) As 
an example of this limitation, we may cite a curve having 
as equation y = alf, which cannot be fitted directly by 
the method of least squares. If the observation equations 
are non-linear they may be reduced to the linear form in 
many instances by the use of logarithms, and the method 
of least squares then employed. , 

DERIVATION OF THE FORMULA FOR THE STANDARD ERROR 

OF ESTIMATE 

It has been pointed out in the body of the text that the 
standard error of estimate may be derived as a by-product 
of the method of least squares. A more complete demon- 
stration of this process may be given at this point. 

When the partial derivative of the expression 

S[(oFi + Wi + cWz -I- dW^ - Yf = a minimum 

is equated to zero, with respect to the first unknown, a, 
we have 

STfi[(aFi + hWi -b cWz + dlTd - Y] = 0. 

- 'Hi 
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Since 

aWi-^hWi + cWz->rdWi-Y = v, 

we have as a necessary condition of fitting 

== 0 . 

When the partial derivative of the same expression with 
respect to h is equated to zero, we have 

SWdCaFi + 6W2 + cWz + dW^) - F] = 0 

or, making the same substitution as in the preceding case, 

SCtiPF?) = 0. 

Repeating the operation with respect to c and d, we may 
show that 

S(yW3) = 0 

and 

SCjjWd = 0. 

In summary: When the method of least squares is 
employed in detemining the most probable values of cer- 
tain unknown quantities, having as known coefficients the 
quantities Wi, Wj, W^, Wt, the following relations hold 
as a necessary condition of the least squares method: 

S(t)1T,) = 0 
SCwTFs) = 0 
2(«TF3) = 0 
S(»tf 4 ) = 0. 

A knowledge of these relationships gives us a method of 
securing readily the value and the standard error of 
estimate. Assume that, by the method of least squares, we 
have determined the constants in an equation of the type 

. . i ^ F. = aFi -b &TF* -f- cFa + dlf 4 . 

For each residual we have the relation 
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Multiplying throughout by v, and summing, we have 

But 

S0;Tfj)=O 
SW =0 
S(z)F 3 ) = 0 
2 (rTf 4)=0 

therefore, 

2(^;2) = -S(7i;). (3) 

Multiplying each equation ( 1 ) throughout by Y, and 
adding, we have 

S(7«)) = aS(T7i7) + 6S(F27) + cS(F37) + dS(F47) 

-S(P). (4)' 

Substituting in (3) the equivalent of S(7?0, we have 

W) = S(7)2 - aS(Fi7) - 6S(Fi,7) - cS(F37) 

-dZ{WJ). (5) 

This gives us a method of obtaining the value Z(i>-) 
without computing the separate residuals, a method that 
is applicable whenever the equation of the curve to be 
fitted is of the form, or may be reduced by the use of loga- 
rithms, reciprocals, or other manipulation to the form 

7 = aFi -(- 6 F 2 + cWa + <iF4. 

In applying this to a particular case it is necessary only to 
replace Wi, W^, Ws, Wi, etc., by the functions that actually 
appear as coefficients of the unknown quantities in the 
original equation. Thus in fitting a curve the equation to 
which is 

Y = a + bX + cX^ + dX^ 

we find, as noted above, that * 

F, = 1 
W,^X 

F3 = A* 
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Making these substitutions in equation ( 5 ) above, we have 
S(«2) = S(F0 - oS(F) - 6S(XF) - cSCZ^F) - (G(Z=>F). (6) 

The standard error, Sy, is derived from the equation 

i^y 


where d is used to represent a deviation from a fitted curve. 
The deviation, d, then, is but another term for the resid- 
ual V. Accordingly, as a general expression for the standard 



error of F, with Wi, Wi, Ws, and TFi as independent 
variables, we have 

„ , SF“ - aS(FiF) - 6S(1F*F) - cB{WbY) - dS(TF4F) 

"W — - — — ^ — — (.7) 

As in the previoiis case, this may be applied to a particular 
problem by replacing Wi, W2, Wi, Wi, etc., by the actual 
coefficients of the unknown quantities. 

DERIVATION OP THE FORMULA FOR THE INDEX OF 
CORRELATION 

We have adopted as an index of the degree of correlation 
between two variables the measure p (rho), derived from 
the equation < 

*■ ■ o a 

pV=T - (8) 

assuming a single dependent variable, F, and a single inde- 
pendent variable, X. With a single dependent variable, F, 
and a number of independent variables, Wi, Wi, Wi, W4, 
the expression might be written 


■ ' ■ ■ . 

* Since our object is to measure the actual ** scatter about the fitted curve, 

the formula is used, rather than the formula (where N repre- 

of observations and Nc the number of constants in the aqua-" 
fitted curve). The second formula would be used, in accordance 
least squares, if we were seeking to determine the mean 
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Corresponding changes would be made in the subscripts for 
other changes in the symbols employed. The expression 
above is equivalent to 


H y’>wiW2WiWi 


= 1 - 

s(y’) 


where 1 / represents a deviation from an origin at the mean 
of the F’s. But 

N ~ N 

where F represents the original values of the F-variable 
and c„ represents the difference between the original origin 
and the mean of the F’s. (The symbols c„, and should 
not be confused with c, one of the constants in the equation 
to the fitted curve.) 

Accordingly, we have 

2(d») 




( 10 ) 


S(F2) - Nc/ 

But we have secured an expression for hiv^) [the equivalent 
of S(d^)] which holds in the case of a curve fitted by the 
method of least squares. Substituting the equivalent of 
S(d^) in the above equation, and simplifying, we have, 
as a general formula for the index of correlation 

P%<wiwiwswi . • . “ ( 11 ) 

aS(T7iF) + bHiW^Y) + c^(WsY) + dS(WiY) + ■ • ■ - Ac,* 

Z(Y^)-Nc/ 

This may be applied to a specific case by replacing Wi, 
Wi, Ws, Wi, etc., in the above formula by the fimctions 
which appear as coefiicients of the unknown quantities in 
the original equation. When all these are functions of a 
single independent variable, as in the usual case, the index 
of correlation would be represented by the symbol p„i. 

CBKTAIN SPECIAL CASES 

In the case of multiple correlation, where the symbols 
Xi, Xi, Xi, Xi, etc., are used to represent all the variables, 
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whether considered dependent or independent, the symbol 
R is employed for the measure of correlation and numerical 
subscripts utilized as described in the body of the text. 

In the case of a straight hne relationship between two 
variables, p is replaced by the symbol r, which represents 
the ordinary coefficient of correlation. As the general 
equation for r we have 

j _ oS(F) + bSjXY) - Ncy^ 


There ai-e two special cases in which this formula may be 
simplified. If the origin be at the mean of the X’s, we 
have 


and the formula for r reduces to 


If the origin be at the mean of the F’s (it is not essential 
that it be also at the mean of the Z’s) 

S(j/) = 0, and Cy = Q 

and the formula for the coefficient of correlation becomes 

.2 - mXy) 


In this latter case the general formula for p may also be 
simplified by the elimination of the ternas aS{y) and Nc/. 

6at»C!KS ON THE FORMATION OP THE NORMAL EQUATIONS 

There are so many possibilities of arithmetical error in 
the formation and solution of a set of normal equations 
that checks should be employed wherever possible. A 
convenient check on the calculations leading to the normal 
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equations is afforded by the introduction in each observation 
equation of an additional term, s, equal to the sum of all 
the known quantities in that equation. Thus, in the following 
system of observation equations, formed in fitting a line 
to the points 1, 3; 2, 4; 3, 6; 4, 5; 5, 10; 6, 9; 7, 10; 8, 12; 
9, 11, the values of s are as indicated: 

' ' s 

3 = -j- 16 5 

4 = a + 26 7 

6 = a + 36 10 

5 = a + 46 10 

10 = 0 + 56 16 
9 = 0 + 66 16 

10=0 + 76 18 
12 = 0 + 86 21 

11 = o + 96 21. 

(The coefficient of a in each case is 1, and this is added to 
the other known quantities.) 

In fitting a curve described by the type equation 

Y^aWi + bW2 + cW, + dW, 

the following relations prevail between s and the other 
quantities computed. For each observation equation, 

F + Fi + Tf 2 + TFa + IF 4 = s. 

For the normal equations, 

S(lFiF) + + 'Z{WiWi) + S(]Filf 3) + SClFiFd = S(lFi.s) 

SCTFjF) + S(FiW 2) + SClFs^) + SCTFjlFs) + S(1F2F4) = S(ff+sO 
S(F3F) + -ZiWiWs) + SCWs) + S(F3“) + S(F3F4) = Sdl+s) 
2(F4F) + S(FiTF4) + + S(F3TF4) + S(1F4*) = S(F4s) 

This form is capable of application to any specific problem. 
In each case the s-equations are formed in precisely the 
same way as the corresponding normal equations. 

In applying these checks several additional columns are 
needed in the working tables, but the extra trouble is 
more than compensated by the opportunity to check the 
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work at each stage. The application is illustrated in the 
following working table, showing the calculations involved 
in fitting a second degree curve of the form 

F = a + 6Z + cX2 

to the nine points 1, 2; 2, 6; 3, 7; 4, 8; 5, 10; 6, 11; 7, IV 
8, 10; 9, 9. 

Table A 

Illustrating the Use of Checks on the Formation of Normal Equations 


2 

6 

7 

8 
10 
11 
11 
10 

9 

74 


1 

2 

3 

4 

5 

6 

7 

8 

45 



XY 

XW 

s 

Xs 

1 

2 

2 

5 

5 

4 

12 

24 

13 

26 

9 

21 

63 

20 

60 

16 

32 

128 

29 

116 

25 

50 

250 

41 

205 

36 

66 

396 

54 

324 

49 

77 

539 

68 

476 

64 

80 

640 

83 

664 

81 

81 

729 

100 

900 

285 

42r 

2,771 

413 

2,776 


5 

'52 

180 

464 

1,025 

1,944 

3,332 

5,312 

8,100 

20,414 


(ColuMs for X3 and are omitted, as the values 2(X3) and S(Z^) may be 
derived from prepared tables.) ; ^ ^ } may oe 

Each of the. values in the column headed s is secured 
from the corresponding observation equation. Thus, from 
the first observation equation 

2 = 1(2 *-f* 16 + tCy 

we have 5 aa the value of s (2, plus the coefficients of the 
three constants). These values of s are secured readily 
from the table by adding the figures in the columns headed 

coefficient of the constant term a. 

Addmg the various columns, the arithmetic work is 
verified by the following checks: 

S(F) + A + S(Z) + 2(Jfs) = S(s) 

74 + 9 + 45 + 286 = 413 

S(ZF) + Sm + 2{X*) + S(Z*) * S(Zs) 
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421 + 45 + 285 + 2,025 = 2,776 

SCZ^F) + S(Z“) + S(Z5) + S(Z^) = 2(Z*s) 

2,771 + 285 + 2,025 + 15,333 = 20,414. 

Further uses of a check of this kind are explained below, 
in discussing the solution of the normal equations. 

OTHEK TESTS 

The possibility of checking the calculations in other ways 
has been suggested in the preceding sections. Thus, where 
the coefficients of the constants in the equation to the 
fitted curve are represented by Wi, W^, Wi, we know 
that 

S(t)Tri)=0 
2(!;F2) = 0 
2(«)Tr3) = 0 

'S(vWi) = 0 . 

If a curve of the t5q3e 

Y = a + bX + cX^ + dZ* 

has been fitted, this means that 

S(!;) =0 
S(t;Z) = 0 
S(»Z2)=0 
S(yZ®) = 0. 

The accuracy of the work may be tested by checking these 
relations. 

Finally, we may test the accuracy of the work by com- 
puting the standard error of estimate in two dffierent ways. 
We may compute the separate residuals by taking the 
difference between computed and actual values of the 
dependent variable, and from these values determine S. 
This may be compared with the results secured by applying 
the general formula for the standard error, as derived above. 
In the fitting of the second degree curve, the data of which 
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were used to illustrate the method of checking the normal 
equations, the equation derived was 

F = - .92860 + 3.52316Z - .267316ZI 
From the residuals separately computed, we have 

= .4941. 

From the formula 


(F^) - a2(F) - bXjXY) - cS(Z^F) 


we have 

S, = .4947. 

This constitutes a final check 
calculations. 


upon the accuracy of the 


SIMPLIFICATION OF NORMAL EQUATIONS IN A MULTIPLE 
' CORRELATION PROBLEM* 

In the discussion of multiple correlation procedure in 
Chapter XVI the normal equations as first derived in the 
form 

I i(Zi) = Na + 6i2.34S(Z2) + bi3.n^(Xs) + 6i4.23S(Z4) 

II SfZiZs) =. aS(Z2) + + 5i3.24S(Z2Z,,) 

+ 6h.23S(Z2Z4) 

III ^(ZiZa) = aSfZs) + 6«.34S(Z2Z3) + 6 i3.24S(Z32) 

+ ^>14,23S(X3X4) 

IV 2(ZiZ 4) = aS(Z4) + 6 i2.34S(Z2Z4) + 6i3.24S(Z3Z4) 

+ 6i4.23S(Z4®) 

were reduced in number and modified to facilitate their 
solution. Details of the method are here given. 

: Letting Ai, A^, A}, and At represent the arithmetic 
means of the several variables, and Xi, x,, x„ and xt represent 
eviations from the means, we may replace the variables 

MdMeSjStiS; P 1?“®^ “r Handling 

2, m American Statistical Assodl 
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equivalents a:. + A, x. + A, 
X, + „ + A4. The normal equations now become: 


■ hi.u + 2(x3 + A3) ■ b 


'13.24 


II 


III 


IV 


S(a:i + Ai) = Va + S(a:j + 4^) 

+ 2(0:4 + A4) • 6 i 4.23 

+ 2(0:2 + A 2 )(a ;4 + A4) • 614.23 
2 [(o:i + Ai)(o: 3 + A3)] = 2(0:3 + A3) • a 

1 a r 1 1'!?’ + -<•)■ ■ 

+ 2 [(o:3 + A 3) (0:4 + A4)] • 6i4,28 
2 [(xi + Ai) (0:4 + A4)] = 2(0:4 + A4) ■ a 

+ 2[(x4 + A4)(0^ + A2)]-6i2 34 

+ 2[(X4 + A4)(X3 + A,)] . 613.24 + 2 (x4 + A4)^ 

Since 2 («i + Ai) = 2 xi +i\rAi, andsinceZxi = 0 , 2 (xi + AV) 
and all similar expressions may be replaced by VA, A^A, etc 
If we exp^d 2fe + .1,). to 2 (V + 24 ^. +’“) the 
middle term drops out, because Sx, . 0, md the exore« on 
may be written + NA,’. The sums T 7 Zbr 
squares may be put in similar form 

+ A2X1 + A1A2) = 2x1x2 + ArAiA2 since 2 xi = 0 and 2x2 

innd’ifipd^°*Th* f type m^y be similarly 

modified. The normal equations now take the form: 


013.24 


O14.23. 


I 

II 

III 

IV 


AAi - Na + NAibnat + VA3613.24 + NA^buM 
2(xiX 2) + VA1A2 = VAaa + [2 (x 2)^ + VA2^]6i2 34 

[2(X2X3) + AA2A3]6i 3.24 + [2(X2X4) + VA2A4]6l4.2a 
2 (xiX 3 ) + VA1A3 = NAza + [2(012X3) + NAiAz]bn.u 
+ [ 2 (X 3 )® + VA 3 % 3.24 + [ 2 (X 3 X 4 ) + iVA 3 A 4 ] 6 l 4.23 
2 (xiX 4 ) + VA1A4 = NAia + [2(012X4) + NAiA,]bn 34 
+ [2(X3X4) + JVA3A4]6i 3.24 + [2(X4)2 + VA4“]6l4,23. 


If we now divide through by N, and substitute 012 for 
2 X 1 X 2 2# 2(X2^) , . 

^ , 0-2 tor — and similar symbols for other mean 
products and mean squares, the normal equations become 



654 THE METHOD OF LEAST SQUARES 


I 

II 


III 


IV 


4l - ® + A2612.34 + A3613.24 + A46 i 4.23 

Pl2 + 4 iA 2 = A2a + (0-2^ -f- ^2^)612.34 + (P 23 + 4243)613^4 
+ (P24 + 4244 ) 614.23 

pu + AiAi = Asa + (P 23 + 4243)612.34 + ((Ts^ + 432)613,4 

+ (P 34 + 4344)614.23 

Pl 4 + 4 i 44 = 44a + (P 24 + 4244)612.34 + (P 34 + 4344)613.24 

+ (0’4 + 442 ) 614 . 23 . 

These four simultaneous equations may now be reduced 
to three. We multiply equation I, throughout, by 4, 
and subtract the result from equation II; we then multiplv 
equation I by As, and subtract the result from equation III- 
we then midtiply equation I by As, and subtract the result 
froin equation IV. All the terms containing 4’s are thus 
ehmmated and we obtain the three normal equations 

P 12 = as^bn.ss + 2 J 23613.24 + Pubu.ss 

Pl 3 = P 2 s 6 i 2.34 + <r 32613.24 4 - P34614.23 

Pu — P 24612.34 + P 34613.24 + 0 - 42614 . 23 . 

Inserting the observed values of the p's and the < 7 ’s, these 
are solved for the coefficients 6. The value a may then be 
obtamed by inserting the values of the 4’s and the b’s 
in the equation 

4; = a + 42612.34 + 4,613.24 + 44614.23. 

SOLUTION OF THE NOKMAL EQUATIONS 

nnJin normal equations is not a difficult 

to tfae economic statis- 
tician. If there are only two or three unknowns the corre- 
s^ndmg numl^r of normal equations may be solved by 
^ple alpbraic methods. Even with three equations 
^ ad^sable to employ a systematic procedure,’ 
with more than three equations this is imperative 
Such sys Wtic methods of solving the simultaneous equa- 

^ connection with the method 

of least squares have been worked out by Gauss and by 
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Doolittle latter method, whieh la perhaps the more 

eo^ement for general «>ge, is demonstrated below. 

Ihe coefflerents of the unknowns in the normal equations 
are alwa^ sjunmetncal with respect to the principS dia^ 

nal. ThiK m securing the most probable values of the 
constants in the equation m hue 

^ = oWi + hWi + cWi-\- dWt, 

we have the four normal equations 

aSiWi^) + + cSfFiFs) + dS{WiWi) - 2(WxY) = 0 

aSiWiWi) + VLiWiW,) + cS(TF3=) + dZiW^Wi) - 2(Tf ,F) = 0 
aSdFfTFd) + ftSCTf.Fd) + cSCWd) + dS{wJ) - S(f!f) = 0 

The symmetrical arrangement about the diagonal, when 
F-teims are neglected, is obvious. Starting with any term 
on the prmcipal diagonal, we have the same coefficients 
directly above as to the left. Thus, above the diagonal 
term in yhichtlie coefficient 2 (^ 3 ^) appears, we have the 
coefficients S(W) and 2(WxW,). The same coefficients 
are found to the left of the given diagonal term, and on 
the same Ime. For the purposes of solution, therefore, 
the terms to the left of each diagonal entry may be omitted, 

and we may put the remaining terms of the nbrmal equations 
in the form 

aS(Fi2) + 6 S(FiF 2) + cSfFiFs) + dSfFiFd - S(FiF) 

+ hSfFa^) + cSfFsFa) + cffifF^Fd) - SfFaF) 

+ cSfFa^) + (ffi(FsF«) - SfFaF) 

+ (E(F4») - SfFdF). 


THE DOOLITTLE METHOD 

The Doolittle method may be illustrated with reference 
to the following normal equations: 

8.3564a + 2.7906 + 2.932c + 47.967 = 0 
2.790a + 6.66456 + 2.063c + 62.039 = 0 
2.932a + 2.0636 + 7.7893c + 47.519 = 0. 
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Putting these, for The purposes of the solution, in th 
abbreviated form given above, we have 

8.3564a 4- 2.7906 +2. 932c 4-47.967 . 

4- 6 . 66456 4- 2 . 063c 4- 62 . 039 
4- 7.7893c 4-47.519. 

We wish to solve these for the constants a, b, and c. A] 
the work of computation, with the necessary checks i 
shown in the following table: ’ 

Table B 

Solution of N onrud Equations by the Doolittle Method 


Reciprocals 


2.790 

6.6645 


8.36640 
I . 00000 


2.700 

.333876 


2.932 

.350869 


62.0464 
7.424896 check 


11966876 


6.6646 

.931614 

5.732986 

1.000000 


2.063 

,978924 

1.084076 

.189094 


73.5565 
20.715470 
52.841030 check 
9.217017 check 


16 . 015030 
46 .023970 
8.027923 


17442917 


7.7893 

1.028748 

.204992 

6.555560 


47.519 
16.830133 
8.702857 
21 .986010 


60.3033 
21.769807 
9.991922 
28 541671 check 
4.353798 check 


15254227 


1.000000 


3.353796 


Back Solution 


3^3796 -*'8.027923 - 5.740161 

- 3.353796 ^ .634183 +2.468592 

- 7.393740 + 1. 176743 

' - 270948f6 

a ==- 2.094816 
7.393740 

, c = - 3.353796 

Check: 

l^uation I: 

; ; 8.3564a + 2.7906 -f- 2.932c = — 4 

Sttfcetituting the given values, 

8.3664(— 2.094816) + 2,790(— 7.393740) 

\ I + 2. 3.36.3796) 
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The coefficients of the unknown quan- 
tities, a, 6, and c, are listed in the designated columns* 
The known term in each normal equation is listed in col- 
umn (5). (The sign of this known term, it should be noted, 
is that which it would have when the entire expression, of 
which it is one term, is equated to zero.) Column tS‘ is 
employed as a check. The value in column s, in each of 
the lines I, II, and III, is the algebraic sum of the known 
values in the given normal equation. In securing this 
sum the coefficients to the left of the diagonal, which have 
been omitted from the table as it stands, must be included. 

The following is a summary of the procedure in solving 
the normal equations: 

1. In line (1) write normal equation I. 

2. In line (2), column (1), write the reciprocal of the value in 
line (1) , column (2) , with sign changed. (This is the reciprocal of the 
coefficient of a.) Multiply each item in line (1) by this reciprocal 
entering the products in the corresponding columns in line (2). 
[The algebraic sum of the items in columns (2), (3), (4), and (5) of 
line (2) should equal the value in column (6).] This operation has 
eliminated the unknown a, by expressing it in terms of b and c. 
[The — 1 in line (2), column (2), has been included only to facili- 
tate the checking process. The same is true in lines (6) and (11).] 
A heavy line may be drawn across the table below line (2) . 

3. Write normal equation II in line (3). ’ 

4. Multiply by the coefficient of 6 in line (2) (i.e., ~ .333876) 
the items in columns (3), (4), (5), and (6) in line (1). Enter the 
products in the corresponding columns of line (4). 

5. Add lines (3) and (4), entering the sums in line (5). [The 
algebraic sum of the items in columns (3), (4), and (5) of line 
(5) should equal the value in column (6).] 

6. In column (1), line (6), enter the reciprocal of the value in 

column (3), line (5), reversing the sign. Multiply each term in line 
(5) by this reciprocal, entering the products in line (6). [The sum 
of the items in columns (3), (4), and (5) of line (6) should equal the 
value in column (6).] This operation has climmated the unknown 6, 
by expressing it in terms of c, A heavy line may be drawn across 
the table below line (6). , 

7. Write normal equation III in line (7). 

8. Multiply by the coefficient of c in line (2) (i.e., — .350869) 
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the items in columns (4), (5), and (6) of line (1). Enter the products 
in the corresponding columns of line (8). 

9. Multiply by the coefficient of c in line (6) (i.e., — .189094) 
the items in columns (4), (5), and (6) of line (5). Enter the products 
in the corresponding columns of line (9). 

10. Add lines (7), (8), and (9), entering the sums in line (10). 
[The algebraic sum of the items in columns (4) and (5) of line 

(10) should equal the value in column (6).] 

11. In column (1), line (11), enter the reciprocal of the value in 
column (4) of line (10), reversirig the sign. Multiply each term in 
line (10) by this reciprocal, entering the products in line (11). 
[The algebraic sum of the items in columns (4) and (5) of line 

(11) should equal the value in column (6).] This operation gives 
the value of c, which is found in column (5) of line (11). A heavy 
line may be drawn across the table below line (11) . 

Were there additional unknowns, as d and e, this last 
operation would have given c as a function of d and e and 
it would be necessary to carry the process still further, 
repeating the steps taken above. The next operation would 
be to bring down the fourth normal equation, entering it 
in line (12). Then the coejBficients of d in lines (2), (6), and 
(11) would be used to multiply the necessary items in 
lines (1), (5), and (10), the products being entered in lines 
(13), (14), and (16). The sum of the items in lines (12), 
(13), (14), andr (15) would be entered in line (16) and 
checked by the item in the s column. Multiplying through 
by the reciprocal of the coefficient of d in line (16), with 
sign reversed, the value of d would be obtained in terms 
of e. The value of e would be derived in a similar fashion. 

The checks on these various operations have been indi- 
cated in the table. The testing of the results at each step 
reduces the possibility of error to a minTmnm . 

The back solution presents no difficulties. We have, 
from line (11), 

c = - 3.353796, 
from line (6) 

5 « .189094c - 8.027923, 


refeeences 
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from line (2) 

a =- .3338766 - .350869c - 5.740151. 

[The items in column (6) axe inserted merely as checks 
^he items - 1 000000 which appear in lines (2) (6) Ind 
(11) are inserted to assist in the checking] 

in^to back solution appear 

^ A final check is afforded by inserting the values secured 
by this process in one of the normal equations. This check 
as carried out for equation I, is shown below the table. 
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DERIVATION OF FORMULAS FOR MEAN AND 
STANDARD DEVIATION OF THE BINOMIAL 
DISTRIBUTION! 


For convenience we put the binomial in the form (q + p)", 
where q = probability of a failure, p = probability of a 
success, and q + p = I- Expanding the binomial, we have 

/n _.L _1„ — X/nl 1 


r 1.2-3 y , . . . ^ ^ . 

The terms of this expansion indicate, in order, the probable 
frequencies of no successes, 1 success, 2 successes, 3 suc- 
cesses, and so on, to n successes. A frequency table of the 
familiar type may be constructed from these materials. 

The items in col. (2) of Table C constitute the terms of 
the binomial expansion. Their sum is thus equal to (g 4- p)”, 
which is, by definition, equal to 1. The items in col. (3), 
added in order, give 

■»/,( >1 — .1— «/ rt — 1 "2-n2 -J— vTfTl— 3 im3 


Since the factors n and p appear in each of these terms, this 
reduces to 


'These derivation.s are adapted from the proof given by 1). C. Jones in 
■I FirnKhiirmin 8tatMca,London,'B6ll &8om, 1921, 143-145. 

660 






662 DERIVATION OF FORMULAS 


But the terms within brackets, following np, represent the 
expansion of the binomial {q + Since $ + p = 1, 

the sum of these terms is 1. Accordingly the sum of the 
items in col. (3) reduces to 

np{q = np. 

For the mean of this distribution we have 

^ SO) 1 

Adding the items in col. (4) in order, we have 
+ 2n{n — l)g"“®p* + — ~ ^ 

+ M ; - gX " - , 3 ) + . . . -I- 




3(n-l)(n-2) 


, 4(n — l)(n — 2)(n — 3) ,, , , ,1 

4 j . 2 . ' 3" —q’‘-*p^ + . . . + np’-i • 

The terms within brackets may be broken into two groups, 
giving 

I g"~^ + (» - l)g'‘~®p* Hr ~ g’‘-®p® 

, (n — l)(n — 2){n — 3) . ^ , , , 1 

4.i ."3 V + - • • 

+ I («• - l)g”-®p* + — g’‘-®p® 

The terms within the first of these two groups constitute 
the expansion of the binomial (q + p)'^\ These terms 
may be replaced by that binomial; the second group of 
terms may be simplified, since they contain the common 

rk'rv£»v»ri+'t4r^ir%r<i 
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>[■(? + pY * + (ra - l)p I gr-a + („ _ 2)g»-api 

, (n- 2)(n-3) . . , 

i- j.o — g”~V + . 


. -j- 


The second ^oup of terms, thus simpUfied, is seen to be 
(n l)p multiplied by the expansion of the binomial 

^ 0^ tbe items in 

col. (4) of the preceding table, 

npliq + p)“-J + (n — i)p(g + p)»-2]. 

But since g + p = 1, (g + p)-i = i and (g + = i 

Accordingly, the total of col. (4) becomes 

np{l+f{n-l)l 

As a general formula for the standard deviation, in 
squared form, we have 



where c is the difference between the mean of the distribu- 
tion and the arbitrary origin. In the present instance, 
the origin is at 0, or “no successes,” and c is equal to the 
mean, or np. N is equal to S(/), or 1, in this case. Thus 
the standard deviation of the binomial distributions given by 

<r* = np[l p(n - 1)] - 
= np{np + (1 - p)] - 
= + np(l — p) — Ji2p2 

= mp(l-p) 

= npq 
<x = Vnpq. 
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DERIVATION OF THE STANDARD ERROR 
OF THE ARITHMETIC MEAN 

We have made n random, hence independent, observations 
on a given variable. The respective observations may be rep- 
resented by Xi, X 2 , Xa . . . X„. Representing the sum of 
the n observations by W, we have 

F = Xi-f X2-bX3 + . . . + X„. (1) 

Additional samples are now taken until we have N values of 
Xi, N values of X 2 , etc., and hence N values of the sum, W. 
We have N samples, therefore, of n observations each. The 
mean values, which we may represent by barred letters, 
stand in the same relationship of equality: 

W = Xi + X^ + X, + . . . + X„. (2) 

Using small letters {w, Xi, x^, etc.) to define deviations of 
the actual observations from these mean values, we may 
write, for any given sample, or series of observations, 

w — + ra ■+■ la 4- ■ • ■ + (3) 

Squaring the two sides of this equation, we have 

ti)* = xd + 4- 23 ® -f . • • 4- Xn^ 4 - 2a:ia;3 4- • . . 

4- ^\Xn 4 - 2a:2a:8 4- . .4- 2a:2a:„ 4- ■ ■ • 

4- 22:3a:» 4- • • • • (4) 

; Each term on the right-hand side of (3) will appear in squared 
form in (4), and there will also appear product terms of the 
form 2xiXi corresponding to all possible pairings of the terms 
on the right-hand side. 

The next step involves the summation of the equations 
of type (4), derived from the N samples, and division 
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throughout by N. Each product term, when thus summed 
and divided by N, will be of the form 


2SxiXi 

~ir~ 


This, with the modification introduced by the factor 2, re- 
sembles the famihar mean product, encountered in cor- 
relation procedure. This mean product, we have seen, has a 
value of zero when the variables x and y are uncorrelated. 
But, by hypothesis, the observations that have given us 
Xi, Xi, Xz, etc., are independent of one another, and hence 
these variables are uncorrelated. Accordingly, each of the 
product terms, derived when N equations corresponding 
to (4) above are summed and divided by N, is equal to zero. 
The process of summation and division gives us, therefore, 


_ 5^1! a. r ^ 

N N N AT' ■*■••• + 



or 


-f -1- o-gS + . . _ (g) 


If all the observations relate to the same universe (i.e., if 
the samples are all drawn from the same parent population), 
which is true, by hypothesis, the standard deviations appear- 
ing in the right-hand member of equation (6) are equal to 
one another and to the standard deviation of the population. 
Accordingly, using <r to represent that standard deviation, 
we have 


<Tw^ = TUT. (7) 

The next argument, that leads directly to the desired 
measurement, follows precisely these steps, which have been 
given in the above form to indicate the reasoning involved. 
It starts, however, with a variant form of equation (3). 
Dividing that equation throughout by n, wq have 
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Working with the variables *2, etc., just as we have 

fh 7h “Th 

done with w, Xi, Xi, etc., we may go through the operations 
represented by equations (4), (5), and (6), above. The 
product terms disappear, as in passing from (4) to (5). In 

the process of squaring the term ^ is treated as an entity; 
the sum of the squared values is thus 2 Numerator and 
denominator of each of the terms of type ^ are squared 

separately, however, and the sum is of the form —■ Division 

throughout by N then gives the quantities appearing in equa- 
tion (9), which corresponds to equation (6). 

/r 2 — . ' (7 2 

+ ^ + „ + . . •+^- (9) 

Since all observations relate to the same universe, this re- 
duces to 


From this 


But w is the sum of n observations drawn from a universe 

ving a standard deviation of <t, and ^ is the mean of these 

observations, cr^ is the standard deviation of a distribution 

of ari^etic means, corresponding to the famihar symbol 
_ This K the desired expression for the standard error 
of the arittoetic mean, appropriate for use when the <t 
of the pop^ation is known. Where o- is estimated from the 
todaxdjd^tion of a sample, accuracy is increased by 
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ILLUSTRATING THE MEASUREMENT OF TREND 
BY A MODIFIED EXPONENTIAL CURVE, A GOM- 
PERTZ CURVE AND A LOGISTIC CURVE 

The discussion in Chapter VII of mathematical functions 
suitable for use in measuring the secular trends of timp series 
dealt with types required in ordinary practice. We here 
discuss briefly three other types suited to the measurement 
of long-term movements in economic and business series. 

The Modified Exponential Curve 
An exponential curve, which plots as a straight line on 
ratio paper, is a suitable measure of trend for a series that 
is increasing or decreasing at a constant rate, that is, one 
that shows constancy of relative growth. The figures defin- 
ing the successive trend values of a series of this type con- 
stitute a geometric progression. The trends of certain eco- 
nomic series that depart from constancy of relative growth 
may be accurately defined by a simple modification of the 
exponential curve. This is the case when the observed 
values may be transformed, by the addition (or subtraction) 
(4 a constant magnitude, to a series closely approximating 
such a geometric progression. 

If we represent by K the constant magnitude that is to 
be added (algebraically) to each observed value in effecting 
the desired transformation, the task of fitting the trend line 
involves the following steps: 

Determination of if. 

Correction of observed values by K, to obtain the modified series. 
Fitting an exponential curve to the modified series, and computer 
tion of trend values of the modified series. 
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Correction of trend values of the modified series by K to obtain 
trend values of original series. 

If represents the ordinates of trend of the original series 
and a: represents time, the equation to the desired line of 
trend may be put in the form 

y = ah^ — K 

where K is the correction factor noted above and a and h 
are constants to be determined by fitting an exponential 
curve to the modified series. The procedme may be illus- 
trated with reference to the data in Table D. 

Table D 

lUudrating the Fitting of a Modified Exponential Curve 
Production of Rayon Filament Yam in the United States, 
1920-19311 

(Data in thousands of pounds) 


( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 

Year 

Original 

series 

{observed) 

Group 

mean 

Modified 

series 

Trend values, 
modified 
series 

Trend values, 
original 
series 




i2) + K 


( 5 ) K 

1920 

10,125 


27,669 

29,108 

34,363 

11,564 

1921 

14,986 

Ml « 

32,530 

16,819 

1922 

24,067 

, 21 , 034.26 

41,611 

40,565 

23,021 

1923 

34,959 


52,503 

47,888 

30,344 

1924 

36,328 


53,872 

56,532 

38,988 

1925 

51,049 

M 2 « 

68,593 

66,736 

49,192 

1926 

62,693 

56 , 406.26 

80,237 

78,782 

61,238 , 

1927 

75,555 


93,099 

93,003 

75,459 

mm 

97,232 


114,776 

109,790 

92,246 

1929 

121,399 

Mz « 

138,943 

129,608 

112,064 

im 

127,333 

124 , 210.76 

144,877 

168,423 

153,003 

135,459 

Wl 

150,879 


180,621 

163,077 


' la employing this method we approximate K empirically 
by breaking the observed aeri^ into three parts, represent- 
ing equal periods of time, and determining the mean of the 
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observations for each period. We may designate these means, 
in chronological order, by Mi, Mi, and M 3 . The desired 
value, A, is given by 

K=[Mi - + -2Mil 

If the observed series constitute a geometric progression 
the value oi K will be zero; if the addition of a constant 
magnitude to the members of the original series will yield 
a series approximating a geometric progression, K will be 
positive; if the subtraction of a constant amount from the 
observed values wiU yield a series approximating a geometric 
progression, if will be negative. (In practice, K is given 
the sign obtained by the employment of the method de- 
scribed above, and then added algebraically to the observed 
series.) 

In the present case we have 

K = [(56,406.25)2 - (21,034.25 X 124,210.75)] 

4- [(21,034.25 + 124,210.75) - (2 X 56,406.25)] 

= -f- 17,544. 

Adding this amount to each of the values recorded in 
col. (2) of Table D, we obtain the modified series in col.<4). 

In fitting an exponential curve to the mo^fied series, it is 
desirable to use logarithms, that is, to solve the constants 
in an equation of the type logy = log® -f- (log6)a:. Thm 
procedure was explained in Chapter VII. For log a of this 
curve we obtain 4 . 824359 , and for log b . 072068. (The origin 
is at 1925.) The antilogarithms of the series of trend values 
thus obtained are given in col. (5). These define the trend of 
the modified series. Subtracting K (algebraically) from these 
values we obtain the trend values of the original series, which 
appear in col. (6). 

The original series measuring production of rayon fal^ 
ment yarn and the modified exponential curve fitted to this , 
series are shown graphically in Fig. A. 

It is essential that the three M’s used in the determine 
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1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 

Year 

Fig. a. — Total Production of Rayon Filament Yarn in the United States, 
1920-1931, with Modified Exponential Trend 

tion of K relate to equal numbers of observations and that 
the midpoints^ in time, of the three periods be equidistant. 
In the above example the number of years included in the 
period is a multiple of three, and no difficulty arises. If the 
number of years included is not a multiple of three, intervals 
that’ overlap sightly may be employed. For example, if our 
had run from 19^ to 1932, the three averages might 
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have been derived, respectively, from the five-year periods 
1920-1924, 1924r-1928, 1928-1932. These would center, re- 
spectively, at 1922, 1926, and 1930, and would thus be equi- 
distant in time from one another. Alternatively, if monthly 
data are available, division of the total period into three 
equal parts may be facilitated by using a time-unit of 4 or 
8 months, rather than 12 months. 

The Gompbrtz Curve 

The Gompertz curve, which has important uses in actu- 
arial science, has had some application in the study of eco- 
nomic and business trends. The term “growth curve” is 
applicable to it, since it portrays a process of cumulative 
expansion to a maximum value. This expansion proceeds 
by decreasing absolute increments in the later stages, but 
continues to the end without retrogression. It may not be 
assumed that this form of growth is typical of all industrial 
development, but the curve has value as an empirical rep- 
resentation of certain trend movements. 

For the purpose of fitting, the equation to the curve is 
transformed from the natural form 

y — 06 °* 

to the logarithmic form ’ 

log 2/ = loga + (log6)c*. 

When fitted to an appropriate set of observations, measur- 
ing the expansion of an industry or the growth of an eco- 
nomic element, log a is the logarithm of the maximum value 
— the ceiling that the curve approaches. The second term 
measures the amount by which the trend value at a given 
time falls short of this maximum, an amount that diminishes, 
of course, with the passage of time. (The series for which 
this curve is an appropriate measme of trend will be ex- 
panding by decreasing amounts in the later stages of the 
period covered, and c, derived in the manner indicated below, 
will have a value between zero and unity.) The origin on 
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the ic-scale (time) is taken at the year to which the first 
entry relates. 

The method employed in fitting this curve is an approxi- 
mative one, since the least squares procedure in customary 
form is not applicable. Here, as in the preceding example, 
the series is broken into three equal portions. The sum of 
the logarithms of the observations in each of these segments 
is obtained; from these sums, and the differences between 
them, the necessary constants may be computed. The 
method is illustrated with reference to the data of rayon pro- 
duction for the years 1920-1937, which appear in Table E. 

Table E ' • . 


Computation of Quantities Required in the Fitting of a 


Gompertz Curve 

ProducMon of Rayon FilamerU Yarn m the United States, 1920“1937 

Year 

Raywi 

production 

y 

(Data in thousands of pounds) 

Log y Sub-totak 

First 

diFerences 

1920 

10,125 

4.0053950 


1921 

14,986 

4. 1756857 


1922 

24,067 

4.3814220 


1923 

34,959 

4.5435590 . 81 - 


1924 

36,328 ' 

4.5602415 26.374290 


1925 

61,049 

4. 7079872 


1926 

62,693 

4.7972191 

= 82 - s, 

1927 

76,655 

4.8782632 

= 3.656786 

1928 

97,232 

4.9878092 


1929 

121,399 

5.0842151 83 - 


1930 

127,333 

5.1049409 30.031076 


1931 

150,879 

5. 1786288 


1932 

134,670 

5. 1292709 

ds » S® -- .83 

1983 

213,498 

5.3293938 

« 2. 095138 

1934 : 

m,821 

5.3187331 

im 

257,667 

5.41(^734 



- 277,626 
; 1 312,236 

5. 4434^1 32.126214 


mt^i' 

6.4944ffi9 



*■ " 

S.,5315^ 
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We may use n to define the number of terms entering 
into each of the three sub-totals (in the present example 
n = &); the sub-totals are represented, in chronological or- 
der, by Si, S 2 , and 8z; the first differences between the sub- 
totals are represented by di and d^. ^ We use these quantities 
in solving for the three constants c, log b and log a. The 
general relations from which these values are determined 
are the following: 


Inserting the proper quantities, we have 

„ „ 2.095138 

^ 3.656786 -^'2945 

c = 572945 = .911351 

3.656786 X - .088649 _ 

‘ (.572945 - 1)2 

, 3.656786 

log a - g I 26 . 374290 ^^^945 _ 1 

= 5.822848. 

The required equation is, therefore, 

logy = 5.822848 - 1 .777493(.91135r) 

in which x relates to deviations from an origin at the position 
of the first term. 

Substituting in this trend equation the values of x given 
in Table F, logarithms of the trend values are obtained. The 
corresponding natural munbers define the course of the line 
of trend. The method of calculation is indicated in Table F. 

^ The condition, previously noted, that the series to which the curve is to 
be fitted be one that is expanding by decreasing increments in the later 
stages of the period covered, is met when d2 is less than di. 


1.777493 
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Table F 


Illustrating the Computation of Ordinates of Trend of a Gompertz 
Curve Fitted to Data of Rayon Production, \Q2Q-\QiS7 


( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 

y 

Anti4og of ( 5 ) 
(in thousands 

Year 

X 


{logh)(f 

logy 
( 4 ) + 






of pounds) 

1920 

0 

1000000 

- 1.777493 

4.045355 

. 11,101 

1921 

1 

0.911351 

- 1.619920 

4.202928 

15,956 , 

1922 

■ 2 

0.830560 

- 1.476315 

4.346533 

22,209 

1923 


0.756932 

- 1.345441 

4.477407 

30,020 

1924 

4 

0.689830 

- 1.226168 

4.596680 

39,508 

1925 

5 

0.628677 

- 1.117469 

4,705379 

50,743 

1926 

6 

0.572945 

- 1.018408 

4.804440 

63,744 

1927 

7 

0.522154 

- 0.928125 

4.894723 

. 78,474 

1928 

8 

0.475865 

- 0.845847 

4.977001 

94,842 

1929 

9 

0,433681 

- 0.770865 

5.051983 

^ 112,715 

1930 

10 

0.395235 

- 0.702527 

5.120321 

131,923 

1931 

11 

0.360198 

- 0 . 6^249 

5.182599 

152,265 

173,523 

1932 

12 

0.328267 

- 0.583492 

5.239356 

1933 

13 

0.299166 

- 0.531765 

5.291083 

195,471 

1934 

14 

0.272645 

- 0.484625 

5.338223 

217,883 

1935 

15 

0.248475 

- 0.441663 

5.381185 

240,539 

1936 

16 

0.226448 

- 0.402510 

5.420338 

263,231 

1937 

17 

0.206374 

- 0.366830 

5.456018 

285,771 

1947 

27 

0 081566 

- 0.144983 

5.677865 

476,283 

1957 

37 

0.032238 

- 0.057303 

5.765545 

582,834 

1967 

47 

0.012741 

- 0.022647 

5.800201 

631,245 


The original data and the Gompertz curve fitted to them 
are shown graphically in Fig. B. 

The ceiling to this curve is set by the constant a, which 
has a value of approximately 665,000,000 pounds. This 
indicates that if the extrapolation of the trend of rayon 
production from 1920 to 1937, as measured by a Gompertz 
curve, accwately defines the future course of production, 
the maximum output to be expected is 665 million pounds 
per year. (It need hardly be pointed out that this extra- 
polation involves some doubtful assumptions, and that no 
mystic significance is to be attached to it.) The years to 
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which the present data relate were years of rapid expansion 
in the industry. The slackening of the rate of increase, 
which is to be expected in a mature industry, had not be- 
come marked by 1937. In order that the nature of the curve 
may be clear, extrapolated values for 1947, 1957, and 1967 
are given in the table, and the projection of the trend is 


1920 1925 1930 1935 1940 1945 1950 1955 1960 1965 

Fig. B. — Total Production of Rayon Filament Yam in the United States, 
1920-1937, with Gompertz Trend Line Extrapolated to 1967 

shown in Fig. B. After 1947, and still mote conspicuously 
after 1957, the curve shows a notable dampening in the rate 
of expansion. We may not say that the industry will actually 
follow this course. In particular, the asymptote a may be 
expected to change, as conditions affecting the industry 
and the demand for its products vary in the future. Within 
the limits of the observations, however, the Gompertz curve 
serves as a satisfactory measure of trend. 

The Logistic Curve 

The logistic curve, sometimes termed the Pearl-Reed 
growth curve because of the extensive use made of it in 
population studies by Raymond Pearl and L. J. Reed, 
resembles somewhat the Gompertz curve disclosed above. 


- 
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Table G 

Computation of Quantities Required in the Fitting of a Logistic Curve 
to Data of Railroad Mileage Operated in the United States, 



by Five-Year Intervals, 1850-1935 


(1) 

(2) 

Miles of 

(3) 

(4) 

(5) 

Year 

railroad 

operated 

100,000,000 

y 

Sub-totals 

, First 
differences 


y 




1850 

9,021 

11,085 



1855 

18,374 

5,442 



1860 

30,626 

3,265 

2,850 



1865 

35,085 

Si = 25,882 


1870 

52,922 

1,890 



1875 

74,096 

1,350 



18^ 

93,262 

1,072 


« 82 - Si 

1885 

128,320 

779 


= - 21,849 

1890 

156,404 

639 

^2 = 4,033 

1895 

177,746 

563 


1900 

192,556 

519 



1905 

216,974 

461 



1910 

240,831 

415 


dz — Sz — 82 

1915 

257,569 

388 


= - 1,679 

1920 

259,941 

385 

-Sb = 2,354 

1925 

258,631 

387 


1930 

260,440 " 
252,930 

384 



1935 

395 




It represents a modified geometric progression, the growth 
of a series that tends to decrease as it approaches some 
specified limit. Like the Gompertz curve it may be used as 
an empirical approximation to the trends of certain economic 
series. Extrapolations are subject, of course, to the same 
uncertainties that attach to projections of other empirically 
derived trend lines. 

( j A form of this curve adapted to use as a measure of trend 
is defined by the equation 


o + 6c*. 
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Tlds, it will be noted, is the equation to a modified exponen- 
tial curve, except that the dependent variable is rather 

y 

than y. (The symbols here used for the constants differ 
somewhat from those employed in treating the modified 
exponential curve.) A method of fitting somewhat simikr 
to those employed in the preceding examples may be em- 
ployed, with necessary modifications required by the use of 
reciprocals of y. The method may be discussed with refer- 
ence to the data of raihoad mileage in Table G. Compu- 
tations are facilitated by multiplying the reciprocals of y by 
a suitable power of 10, as is done in col. (3) of this table. 

As in the two preceding illustrations, the observations are 
divided, chronologically, into three equal groups. Group 
sub-totals and the first differences between these sub-totals 
are computed. The symbol n is used for the number of terms 
in each of these sub-groups. The origin of the a;-seale (time) 
is set at the date of the first observation. The time unit 
here employed is five years. 

The constants in the desired equation may be derived 
from the following relations. 


Substituting the given values, we have 


« - 1,679 
^ - 21,849 

^.076846 = .652034 

-21,849(- .347966) 
(.076846 - 1)2 


-f .076846 
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» - ^ ( 25,882 - ( ^ } - + 369.04. 

These results relate to initial observations which have been 
modified by the multiplication of ^ by 100,000,000. The 
desired equation is, therefore, 


100,000,000 

y 


= 369.04 + 8,921. 14(. 652034*) 


where x measures deviations in five-year units from an origin 
at 1850. 

Succeeding calculations are shown in Table H. 


Table H 


Commutation of Ordinates of Trend of Logistic Curve Fitted to Data 
of Railroad Mileage 


(1) 

(2) 

(3) 

(4) 

(5) 

100,000,000 

(6) 

y 

Year 

X 

■ ■ ■ 

I>C^ 

y 

(a + he/) 

(100,000,000 X 

1850 

0 

l.OOOOCX) 

8,921 

9,290 

10,764 

1855 

1 

.652034 

5,817 

6,186 

16,166 

1860 

2 

.425148 

3,793 

4,162 

24,027 

1865 

3 

.27^211 

2,473 

2,842 

35,186 

1870 

4 

. 180751 

1,613 

1,982 

50,454 

1875 

5 

.117856 

1,051 

1,420 

70,423 

1880 

6 

.076846 

686 

1,055 

94,787 

1885 

7 

.050106 

447 

816 

122,549 

1890 

8 

.032671 

291 

660 

151,515 

1895 

9 

.021303 

190 

559 

178,891 

1900 

10 

.013890 

124 

493 

202,840 

1905 

11 

.009057 

81 

450 

222,222 

1910 

12 

.005905 

53 

422 

236,967 

1915 

13 

.003850 

34 

403 

248,139 

1920 

14 

.002511 

22 

391 

255,754 

1925 

15 

.001637 

15 

384 

260,417 

1930 

16 

.001067 

10 

379 

263,852 

1935 

17 

.000696 

6 

375 

266,667 
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The process of calculation is a straightforward one. The 
reciprocals of the entries in col. (5), multiplied by 100,000,- 
000, yield the desired trend values given in col. (6). These 
values, with the original series, are shown graphically m 
Fig.C. 

As in the case of the Gompertz curve, the logistic is suit- 
able for measuring the trend of a series that, in its later 


Fig. C. — Railroad Mileage Operated in the United States, by Five-Year 
Intervals, 1850-1935, with Logistic Trend 

stages, is growing by decreasing increments. The curve 

resembles an elongated S rising from a lower asymptote of 

zero to an upper limit indicated by the constant a. Since a in 

this case refers to an equation in which the dependent vari- 

. 100 . 000,000 ^ 100 , 000,000 

able is — ’ — > the actual asymptote is 

y a 

Prom the given value of a, 369.04, we derive 270,973 miles 

as the upper limit of railroad mileage in the United States. 

As is clear from the table and chart, the actual values are 

close to this indicated limit. Barring the possibility of a 
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fundamental change in relevant conditions, the record and 
the cmve fitted to it indicate that the era of railroad ex- 
pansion has ended. The extrapolation is, of course, subject 
to all the reservations that attach to the projection of other 
curves. There can be no doubt that, within the limits of 
the observations, the logistic curve gives an excellent rep- 
resentation of the actual history of railroad operation in the 
United States. 


APPENDIX E 



A FURTHEE APPLICATION OF VARIANCE 
ANALYSIS 


The possibilities of Fisher’s method of variance analysis 
were far from exhausted by the several examples given in 
Chapter XV. We here supplement the treatment in that 
chapter by an additional example, illustrating further tests 
that may be made with a two-fold principle of classification. 

The observations on which this example is based consist 
of relative numbers, measuring the prices of 670 commodities 
in February, 1933, with average prices in 1926 taken as 100. 
February, 1933 marked the low point of the severe price 
decline that began in 1929. The questions to which our 
tests are directed relate to the relative severity of the declines 
occurring among different classes of goods. 

The 670 price relatives (obtained from price quotations 
compiled by the U. S. Bureau of Labor Statistics) may be 
classified into those relating to perishable goods (505 in 
number) and those relating to durable goods (l65 in number). 
The classification has economic significance because of differ- 
ences in the market conditions, on both supply and demand 
sides, affecting these classes of goods during a major reces- 
sion. Again, the 670 observations may be broken down into 
those relating to raw materials (134 in number) and those 
relating to manufactured goods (536 in number). Applying 
the two principles of classification jointly we obtain 4 sub- 
groups, perishable raw materials (101 in number), perish- 
able manufactured goods (404 in number), durable raw 
materials (33 in number) and durable manufactured goods 
(132 in number). It is to be noted that the ratio of the num- 
ber of perishable raw materials to the number of perishable 

' eai ’ 'b- ; 
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manufactured goods, 101:404, is the same as the ratio 
of the number of durable raw materials to the number of 
durable manufactured goods, 33:132. It is a necessary con- 
dition of the procedure here discussed that the frequencies 
in the several sub-groups be proportional. 

Various questions relating to the significance of these prin- 
ciples of classification maybe answered with reference to the 
summary figures given in Table I. 



Table I 

Measurements Relating to the Analysis of the Relative Prices of 
of 670 Commodities for February y 1Q3Z 
(1926 « 100) 


PerishaUe 
manufactured goods 


AU 

perishable goods 

Np = 505 
Mp = 58.196040 
= 253,040.57 


Perishable 
raw materials 


II 

All 

durable goods 

Nd - 165 
Md == 73.587879 
m = 46,525.97 


Durable 

manufactured goods 

i\r4«i32 
Mi « 75.719697 
= 31,308.63 


Durable 
raw materials 


All AM All 

raw materials manufactured goods commodities 

Nr « 134 536- N = 670 

Mr = 47.425373 = 65.626866 M = 61.986567 

S# = 56,952.76 ■Z#: =..236,562,35: m = 329,029.89 

The entries relating to each group and sub-group define 
the nunaber of conunodities included, the mean value of the 
price relatives for February, 1933, and the sum of the 
squares of the deviations of the observations in that group 
from the mean of that group. Thus for perishable raw ma- 



A TEST OP THE PERISHABLE-DUEABLE PRINCIPLE OP 
CLASSIFICATION 

We may first ask whether the application of the two 
basic principles of classification, considered separately, gives 
groups showing significant differences in their price changes 
from 1926 to February, 1933. Examining the results of the 
perishable-durable distinction, we note that durable goods, 
with an average of 73.587879, show smaller price declines 
than perishable goods, for which the average is 58 . 196040. 
(Six decimal places are retained in the averages because these 
figures enter into later calculations.) Is the difference sig- 
nificant, or may it be attributed to chance? A test of the 
type discussed in Chapter XV provides an answer to this 
question. For the application of the test we must divide the 
total variability, 329,029.89, into a portion unaffected by 
perishable-durable differences and a portion that may be 
attributed to the play of forces directly related to this dis- 
tinction. 

The first of these portions, measuring the variability 
within classes, is derived directly from the figures in Table I. 


Variability within perishable group 

= StP for that group =253,040.57 
Variability within durable group 

= Sd® for that group = 46,525.97 
Total variability within classes 299,566 . 54 

In deriving a measure of the variability between classes 
we take the deviation of each class mean from the mean of 
all the observations, square this, and weight by the number 
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of observations in that class. Thus 
SfP between perishable-durable classes 

= [(61 . 986567 - 58 . 196040)* X 505] -b [(61 . 986567 - 73 . 587879)* 

X 165] =: 29,463.31. 

A test of the significance of this classification reduces to 
the question whether the variabihty between classes is sig- 
nificantly greater than the variabihty within classes, when 
account has been taken of the number of degrees of freedom 
present in the two measures of variability. The appropriate 
3-test is shown below. 


Nature of 
mncMliiy 

Degrees of 
freedom 
n 

Bum of 
sqmres 

Variunce 

0r2 

Log^c^ 

Between classes 

1 

29,463.31 

29,463.31 

10.290900 

Within classes 

668 

299,566.54 

448,45 

6.1058)4 


669 

329,029.85 

Diff. 

4.185096 


2 = 2.09 

For Hi - 1 and = 668 the 1 per cent value of z is approx- 
imately .95; the present value is materially greater than 
this. The variance between classes is significantly greater 
than the variance within classes. The results are not con- 
sistent with the hypothesis that the true value of z is zero. 
There is a significant difference between the February, 1933, 
price relatives of perishable and durable goods, on the 1926 
base. This principle of classification is a significant one, 
with reference to this aspect of price behavior. 

A TEST OF THE RAW-MANUFACTURED PRINCIPLE OF 
: I CLASSIFICATION 

The test of the other main principle of classification follows 
exactly the same lines. The total variability, 329,029 . 8il, 

[ is broken into a portion measuring variability within classes 

' (293,616.11), with 668 degrees of freedom, and a portion 
meaBuring variability between the raw-manufactured classes 
li i (86,614.75) with 1 degree of freedom. The value of « is 

2.20; the corresponding 1 per cent value of z is .95. This 
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principle of classification, also, is significant. Raw and 
manufactured goods differed significantly in degree of price 
change between 1926 and February, 1933. 


A TEST OF THE RESULTS OBTAINED FROM THE JOINT APPLICA- 
TION OF THE PERISHABLE-DURABLE AND RAW-MANU- 
FACTURED PRINCIPLES OF CLASSIFICATION 

The application of the two principles of classification dis- 
cussed above yields the 4 cells shown in Table I. We may 
ask whether the four groups thus distinguished — perishable 
raw materials, perishable manufactured goods, durable raw 
materials, and durable manufactured goods — are signifi- 
cantly different, judged with reference to the present obser- 
vations. The two essential elements of the total variability 
are derived from the figures in Table I in the manner indi- 
cated below. 


Variability within perishable raw materials group 
Variability within perishable manufactures group 
Variability within durable raw materials group 
Variability within durable manufactures group 
Total variability within cells 
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A measure of variability between cells is derived as in the 
previous examples. 

2^2 between cells = [(61 . 986567 - 41 . 663366)=* X 101] 

+ [(61.986567 - 62.329208)2 X 404] 
+ [(61.986567 - 65.060606)2X33] 
+ [(61.986567 - 75.719697)2X132] 
= 66,970.60. 

The test of significance takes the following form. 


Nature of 
variability 

Degrees of 
freedom 
n 

Sum of 
squares 

Variance 


Between cells 

a 

66,970.60 

22,323.53 

10.013395 

Within cells 

666 

262,059.28 

393.48 

5.975035 


669 

329,029.88 

Diff. 

= 4.038360 


2 = 2.02 

For wi = 3, ^2 = 666 the 1 per cent value of z is approxi- 
mately .67. The present value materially exceeds this. 
The conclusion is clear that the joint application of the two 
principles of classification yields sub-groups which differed 
significantly in their price movements between 1926 and 
February, 1933. 

FURTHER TESTS, OF THE MAIN PRINCIPLES OF CLASSIFICATION 

The test applied in the preceding section does not bring 
out the most significant uses of a two-fold principle of classi- 
fication. In treating the four cells as we have, we have not 
made full use of the information we possess about them. The 
variance between cells, measured by the sum 66,970 . 60, with 
3 degrees of freedom, represents the combined influence of 
. forces related to the perishable-durable principle of classifi- 
cation, to the raw-manufactured principle, and to the inter- 
action among forces related to these two principles. We 
may apply more refined tests, and obtain more accurate 
information about the differential price behavior of com- 
modities of different types, by distinguishing the components 
of the variance between cells. This is done in Table J, 
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of the observations with which we 

are working. 



Table J 



Components of Variance among Observations Relating to Commodity 

Price Movements, 1926 — 

February, 1933 



(1926 = 100) 


Nature of 

Degrees of 

Sum of 

Variance 

variability 

freedom 

squares 

<r^ 

Between perishable-durable 
classes 

Between raw-manufactured 

1 

29,463.31 

29,463.31 

classes 

Interaction (residual varia- 

1 

35,514.75 

35,514.75 

bility between cells) 
Within cells experimental 

i 

1,992.54 

1,992.55 

error^^) 

666 

262,059.28 

393.48 


669 

329,029.88 



Having these components we may test with greater ac- 
curacy than on pages 683 and 684 the significance of the two 
main principles of classification. For we now have a better 
yardstick, a better measure of the magnitude of variations 
due to the play of “chance.” The variabihty within cells 
(variance = 393.48) is a better criterion of the magni- 
tude of sampling errors than is the variability within the 
perishable and durable classes (variance = 448.45) or the 
variability within the raw and manufactured classes (vari- 
ance = 439.39). For the variance within the four cells is 
free of the influence of forces connected with either of the 
specified principles of classification. 

This more accurate test of the perishable-durable prin- 
ciple of classification is applied by the customary method. 


Nature of Degrees of Variance 

variability freedom <r® 

Between perishable-durable 
classes 1 29,463.31 

Within cells 666 393.48 


Log^a^ 


10.290900 
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The 1 per cent value of z is approximately .95, and the 
above result is clearly significant. The application of the 
perishable-durable principle of classification, under the con- 
ditions represented in Table I, yields classes of commodities 
that differed significantly in their price changes between 
1926 and February, 1933. It is important to note that raw 
and manufactured goods are present in the perishable and 
durable groups in precisely the same proportions. One fifth 
of the commodities in each group are raw materials and 
four fifths are manufactured goods. Thus behavior peculiar 
to raw materials may be expected to influence the two 
groups in precisely the same degree; the same is true of 
behavior peculiar to goods in the manufactured state. ^ It is 
necessary, for this reason, that the frequencies in the several 
classes be proportional in the application of the tests here 
discussed, when two principles of classification are jointly 
employed.'^ 

A test of the significance of the raw-manufactured prin- 
ciple of classification may be applied in the same way. 
The variance within cells is employed as yardstick, as in the 
preceding example. Here, also, proportionality is necessary, 
with raw and manufactured goods being divided in the same 
proportions int« perishable and durable sub-groups. The 
test reveals a significant difference in price behavior between 
raw and manufactured commodities. 

A TEST OF THE INTERACTION 

Not all the variability between cells is explained by the 
two major classifications we have just discussed. The 
residual variability between cells, or the interaction, amounts 
to 1,992.54, in terms of squared deviations (see Table J). 

^ See below, however, for a test of the significance of the interaction. 

* For a dieotission of procedures appropriate to oases in which cell frequencies 
are not proportional see 

Yates, F. Journal of Affriev&iural Science, Vol. 23, 108 (1933). 

Snedeoor, Q, W. and Oox, G. M. louxt AgricuUund Experiment Station 
ButkHn 180 (IW). 
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This may be derived readily by subtracting from the total 
variability between cells (66,970 . 60) the sum of the variabil- 
ity between perishable-durable classes (29,463.31) and the 
variability between raw-manufactured classes (35,514.75). 
The number of degrees of freedom in the interaction may be 
determined by the same process of subtraction. In the pres- 
ent instance it is 1. 

This residual variability may represent “experimental 
error, ” the play of the same chance forces that are measured 
by the variability within cells. The residual variability 
was used, in the last example cited in Chapter XV, as a yard- 
stick defining the magnitude of fluctuations due to chance. 
It is proper to assume that this is the case when the two 
major principles of classification are quite independent of 
one another. But if these principles are correlated, the re- 
sidual variability reflects the interaction of the two prin- 
ciples of classification — the differential behavior of given 
classes of goods under the influence of forces related to the, 
other principle of classification. Thus it may be that the 
difference between raw perishable and manufactured perish- 
able goods is not the same as the difference between raw 
durable and manufactured durable goods. The process of 
fabrication applied to perishable goods may»produce results 
(in the form of price behavior) different from those produced 
when the process of fabrication is applied to durable goods. 
Perishable and durable goods may respond differently, as 
regards their price behavior, to the influence of fabrication. 
Such differential behavior of categories of goods under the 
influence of the same treatment (i.e., fabrication) is meas- 
ured by the interaction. 

If there is no such differential behavior, in a given ex- 
periment, the residual variability between cells will be of 
the same order of magnitude as the variability within cells, 
when account is taken of number of degrees of freedom. A 
test is applied on page 690. 

If we judge this result with reference to the 1 per cent 
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Nature of Degrees of Sum of Variance ^ 

mriability freedom squares 

Interaction (residual vari- 
ability between cells) 1 1,922.54 1,922.54 7.561429 

Within cells 666 262,059.28 393.48 5.975035 

Diff. = 1.586394 
z=.7Q 

value of z (.95), we would conclude that the residual vari- 
ability between cells is attributable to the play of chance 
rather than to any true interaction. For although the resid- 
ual variability is greater than the variance within cells 
which we use as yardstick, the excess is not clearly too great 
to be attributed to chance. Reference to the 5 per cent 
value of 2 (.675, for Wi = 1, = 666) throws more light 

on the situation. Less frequently than 5 times out of 100 
would the play of chance alone give us a measure of resid- 
ual variability as great as that here obtained. For the z 
of . 79 is greater than the 5 per cent value, . 675. In such a 
case as this, where P falls between . 01 and . 05, the evidence 
is not conclusive. There is, however, a strong indication 
that perishable and durable goods respond differently, in 
their price beha\dor, to the process of fabrication. Reference 
to Table I will show that among both perishable and dur- 
able goods fabrication appears to have reduced susceptibility 
to price decline under the force of business recession. 
is distinctly greater than Mi, and Mi is greater than M^. 
But the influence of fabrication was apparently greater 
among perishable than among durable goods. ^ Our test 
shows that the degree of difference between the two reduc- 
tions (i.e., reductions in degree of price decline) is almost 
too great to be attributed to chance. The evidence of differ- 
ential behavior is strong enough to justify further investi- 
gation. 

^ Hie statistical evidence does not, of course, yield information as to the 
nature of the musal relations involved. The test here applied, if positive, 
reveals the presence of interaction, but does not show how the forces involved 
inivnu't to bring about the observed differential behavior. The text is to be 
read with this qualification in mind. 





APPENDIX P 


GLOSSARY OF SYMBOLS 

The following are the naore important symbols employed 
in the preceding pages. Those of which limited use is made, 
for special purposes, are not here included. A given symbol 
is sometimes called upon to serve different purposes, but the 
precise meaning should be clear from the context. 


1 . General symbols for variables and constants: 

x: a variable quantity. 
y: a variable quantity. 

In general, any letter near the end of the alphabet may 
be employed to represent a variable quantity. Different 
variable quantities may be represented by the use of a 
single symbol, with different subscripts, as Zi, Z2, Z3, 
or TFi, TF2, TFs. [A distinction is later drawn (cf. Sym- 
bols employed in the measurement of relationship) 
between capital letters and small letters, as used to 
represent variable quantities.! 
a: a constant (i.e., a quantity the value of which does not 
change in the given discussion). In general, any letter 
near the beginning of the alphabet may be used to 
represent a constant. 


2 . 


Symbols employed in the analysis and description of 
the frequency distribution: 

m: the value of an individual observation; the value of the 
mid-point of a class, (The symbols ai, as, as are some- 
times employed to represent different observations in 
a series.) 

/: the number of observations in a given class; the frequency 
of a given class. 

i: the class-interval. 

1 : the lower limit of a class. 

N: the total number of cases in a given series or frequency 
distribution. 
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d: the deviation of a given observation from an average; 
usually, a deviation from the arithmetic mean. When 
written with a subscript, as dx or dy, it refers to a devia- 
tion from the arithmetic mean of the variable repre- 
sented by the subscript. The symbol d is sometimes 
used to designate the difference between mean and 
mode, 

d': the deviation of a given observation from an arbitrary 
origin, or assumed mean. 

c: the difference between an arbitrary origin, or assumed 
m^an, and the true mean (in terms of the symbols ex- 
plained below, c = M M'), 

S (Sigma): the symbol for the process of summation. Thus 
Zd means the sum of all the deviations. 

tt?i, w% Wzt weights attached to a series of measures being 
averaged. (Not to be confused with similar symbols 
used to represent different variable quantities.) 

yo: the maximum ordinate of a frequency curve. 

Symbols Jiw averagely quarUles^ etc.: 

M: the arithmetic mean. 

Md.: the median. 

Mo.: the mode. 

Mg: the geometric mean. 

H: the harmonic*, mean. 

ikf': the value of an assumed arithmetic mean. 

Qi: the first or lower quartile. 

Qg: the second quartile or median. 

Qz: the third or upper quartile. 

K: the value of a point midway between the first and third 
quartiles, 

Dz: the third decile. 

Symbols far measures of variaiion and skevmess: 

M3.: the mean deviation. 

a: the standard deviation; the root-inean-square deviation 
about the arithmetic mean. 

. the standard deviation of pro|X)rtions, or relative fre- 
quencies. 

$&: the root*ineitn«sqiiare deviation about an origin other 
; ‘ ■ than the 'wtlbwelb mmm. 
tte probable tmmr. 
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Q.D the quartile deviation. 

gi: the difference between the median and the lower quartile 

q^: the difference between the upper quartile and the median 

{Qz-m,), 

V: the coefficient of variation, 
a measure of skewness. 

X (Chi)* a measure of skewness based upon the criteria 
and 02 - 


Symbols for moments and criteria of curve type, 

Vh ^ 2 , vz, etc.: moments of a frequency distribution about an 
arbitrary origin. 

TTi, 7r2, TTs, etc.: uncorrected moments of a frequency distribu- 
tion about the arithmetic mean. 

Ml? M 2 , Ms? etc.: moments of a frequency distribution about the 
arithmetic mean after the application of Sheppard^s 
corrections. 

« 2 . 


/C 2 * A criterion of curve type based on 0i and 02- 


3. Symbols relating to index numbers. 

price of a given commodity at time ^^0’^ (the base period), 
quantity of same commodity at time ^‘0”. 
price of same commodity at time 
quantity of same commodity at time 
price of a second commodity at time ^^0'’. 
quantity of second commodity at time ^^0”. 
price of second commodity at time 
quantity of second commodity at time 

a price relative (relation of price of a given commodity 


Po: 

go': 

p/: 

g/: 

Po": 

go": 

Pi": 
« ff . 

gi • 

pl. 

7?o 
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4 . Symbols employed in the meastirement of relationship. 
X: an observed value of a variable quantity. 

Y: an observed value of a variable quantity. (The observed 
values of different variables may be represented also by 
_ the symbols Xi, X2, X3, or TFi, IF2, IFg.) 

X: the arithmetic mean of a number of observed values of 
the variable Z. A similar symbol may be employed for 
other variables. (In one demonstration in the preceding 
pages, relating to multiple correlation, the symbols Ai, 
A2, A3 . . . are used to represent the arithmetic means 
of the variables Xi, X2, Xg . . . . The symbols M* 
and My axe occasionally employed to designate the 
arithmetic means of different variables.) 
x: value of a variable quantity expressed as a deviation 
from the arithmetic mean of all the observed values. 
The symbol y and the symbols xi, X2j Xz . . . are 
similarly employed with respect to variables repre- 
sented, as to original observations, by the symbols 

F,Xx,X2,X3 . . . . 

X': a value of a variable quantity expressed as a deviation, 
in class-interval units, from an arbitrary origin. The 
symbol F' has a similar meaning. 

X'': a value of a variable quantity expressed as a deviation, 
in original units, from an arbitrary origin. The sym- 
bol F" has a similar meaning. 

Yc: the computed or estimated value of a variable, as de- 
termined from an equation of average relationship; 
the symbol may be employed for such a computed 
value, expressed as a deviation from the mean. 
p: the mean product of two variables when expressed as 
deviations from their respective arithmetic means, i.e., 
S(a:i/) 

p = __ — When written with subscripts, as pig, the 

latter relate to the variables in question, as Xij x^. 
p': the mean product of two variables when expressed as 
deviations from assumed arithmetic means, 
r; the Pearsonian coefficient of correlation. When written 
with subscripts, the latter indicate the variables to 
which the coefficient relates. Thus refers to the 
variable y and and rn refers to the variables xi 
and Si. ' 

p- (rho); a general Wex of correlation. Subscripts should be 
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employed to indicate the variables to which the meas- 
ure relates, as pyxi Pxyy Piogyxj Piagviogxj pi , etc. 

-X 

V 

(In each case the first subscript relates to the depend- 
ent variable.) 

p; a corrected index of correlation. 

d: the deviation of a given observation from a fitted curve; 
the difference between an observed and a corresponding 
computed value of a variable. 

v: a residual; identical in meaning with d, as given above. 

S: the root-mean-square deviation about a fitted curve; 
the standard error of estimate. This measure should 
be written with a subscript to indicate the variable to 
which it applies, as Siog y (the standard error of 
estimate in terms of logarithms), Sr (the standard error 
of estimate in terms of ratios), Si (the standard error 

y 

_ of estimate in terms of reciprocals). 

S: a corrected standard error of estimate. 

Pr: the coefficient of rank correlation. 

77 (eta): the correlation ratio. Subscripts should be employed 
to represent the variables to which the measure re- 
lates, as 7 ]yx or rjxy. The first subscript in each case 
relates to the dependent variable. 

77: a corrected correlation ratio. 

(Tay: the root-mean-square deviation about a line through the 
means of the columns of a correlation table; the stand- 
ard deviation of the y-arrays about their respective 
means. The symbol (Tax has the same meaning with 
respect to the rows of a correlation table, or the 
a:-arrays. 

(Tmy: the standard deviation of the means of the columns of 
a correlation table about the mean of all the y% the 
mean of each column being weighted by the number of 
items in that column. The symbol (X^x has the same 
meaning with respect to the means of the rows, 
f (zeta) : the test for linearity of regression (f = 

m: the number of arrays employed in the computation of a 
given correlation ratio; also, the number of constants 
in the equation defining a curvilinear or multiple 
regression. 

b: the coefficient of regression; the slope of a line of regres- 
sion. When written with submsripts, the latter relate 
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to the variables in question, as hyxj bn (for the variables 
xij X2). The first subscript relates to the dependent 
variable in each case; byx is the coefficient of regression 
of y on X and b:^ is the coefficient of regression of x on y. 
z: a logarithmic transformation of the coefficient of corre- 
lation. z = i{loge (1 + r) — loge (1 — r)}. 
i2i.234.* the coefficient of multiple correlation between a de- 
pendent variable, and a combination of independent 
variables, X2, ^3, and X4, The order may be changed, 
but the primary subscript always relates to the de- 
_ pendent variable. 

Rinu: a corrected coefficient of multiple correlation. 

ri2.34.* the coefficient of partial or net correlation between the 
variables and when the variables x^ and Xa are 
held constant. The order of subscripts is changed for a 
different combination of variables, the two primary 
subscripts always relating to the varial)les between 
which the net correlation is being measured. 

612.34: the coefficient of net regression between the variables 
xi and X2, the former being de])end(‘nt, when the vari- 
ables xs and Xi are also taken account of in the estimat- 
ing equation; the weight giv(ni to :r2 in estimating Xu 
when tlie estimate is also based upon values of x$ and 
Xi, The order of subscripts is changed for a different 
combination of variables. 

iSi.234: the root-mean-square deviation about a line describing 
the relationship between a dependent variable, Xi^ and 
a series of independent variables, ^2, and X4; the 
standard error of estimate of xi under these conditions. 

0*1.234: the standard deviation of the fourth order; identical 

with Si, 234. 

a coefficient of partial regreasion in an equation relating 
to variables expressed in standard deviation units. 

(In the seven measures immediately above, the number of subscripts corre- 
sponds to the number of variables included in a given study. For the sake of sim- 
plicity, only four variables have been assumed.) 

5. Symbols employed in the measurement of errors. 

‘ <r: the standard deviation, of a parent population. 

O'M or <r*f* the standard error of a mean, derived from a knowledge 
of tihe 0 of the population. 

; . i • the Btandard deviation of a sample. 


^ : 
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bm or s^: the estimated standard error of a mean, in the deriva- 
tion of which s is used as an approximation to cr. 

T: the deviation of a given statistical measurement from 
the mean of a normal distribution, expressed in units 
of the standard deviation of that distribution; a normal 
deviate. 

t: the deviation of a given statistical measurement from a 
hypothetical value, expressed in units of the estimated 
standard error of the measurement in question. 

(tms: the standard error of the mean of a stratified sample. 
D: a difference between two means. 

ctd: the standard error of the difference between two means. 
Dp: a difference between two percentages. 
ax) : The standard error of the difference between two per- 
centages. 

Dz: the difference between two logarithmic transformations 
of the coefficient of correlation. 
ctd^: the standard error of Dz, 

a, with any subscript, is used to represent the standard 
error of the measure to which the subscript relates. 

P.E. with any subscript is used to represent the prob- 
able error of the measure to which the subscript re- 
lates (P.E. = , 674490 -). 

cr&i-feg: the standard error of the difference between two coeffi- 
cients of regression. 


6. Symbols employed in the analysis of variance. 

s: the difference between the natural logarithms of two 
standard deviations. 

azi the standard error of z, 

Yhi: the number of degrees of freedom in the larger of two 
variances being compared. 

n^: the number of degrees of freedom in the smaller of two 
variances being compared. 


the probability of a successful outcome of a given 
event. 

the probability of an unsuccessful outcome of a given 

event. ^ ^ . 

the number of independent events in a given trial. 
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X®: a quantity used in testing hypotheses involving the 
computation of theoretical frequencies; defines the 
relative magnitude of the differences between observed 
and theoretical frequencies. 


Greek Alphabet 


Letters 

Names 

Letters 

Names 

Letters 

Names 

A a 

Alpha 

I t 

Iota 

pp 

Rho 


Beta 

Kic 

Kappa 

So- 

Sigma 


Gamma 

A X 

Lambda 

T T 

Tau 

AS 

Delta 

Ufi 

Mu 

Tv 

Upsilon 

E e 

Epsilon 

N V 

Nu 


PM : 

z r 

Zeta 

Bf 

Xi 

Xx 

CM 

Rt) 

Eta 

0 0 

Omicron 

W 

'Psi^. . 

® e 

Theta 

n TT 

Pi 

n m 

Omega 



Appendix Table I 

Areas of the Normal Curve of Error in Terms of Abscissa 


0.0 .00000 .00399 .00798 .01197 .01695 .01994 .02392 .02790 .03188 .03586 

0.1 .03983 .04380 .04776 .05172 .06567 .05962 .06366 .06749 .07142 .07535 

0.2 .07926 .08317 .08706 .09095 .09483 .09871 .10267 .10642 .11026 .11409 

0.3 .11791 .12172 .12552 .12930 .13307 .13683 .14058 .14431 .14803 .15173 

0.4 .15542 .16910 .16276 .16640 .17003 .17364 .17724 .18082 .18439 .18793 

0.5 .19146 .19497 .19847 .20194 .20540 .20884 .21226 .21566 .21904 .22240 

0.6 .22575 .22907 ,23237 .23665 ,23891 .24216 ,24537 .24857 .25175 .26490 

0.7 .25804 .26115 .26424 .26730 .27035 .27337 .27637 .27935 .28230 .28524 

0.8 .28814 .29103 .29389 .29673 .29955 .30234 .30511 .30785 .31057 .31327 

0.9 .31694 .31859 .32121 .32381 .32639 .32894 .33147 .33398 .33646 .33891 

1.0 .34134 .34375 .34614 .34850 .36083 .35314 .36643 .36769 .35993 .36214 

1.1 .36433 .36650 .36864 .37076 .37286 .37493 .37698 .37900 .38100 .38298 

1.2 .38493 .38686 .38877 .39065 .39251 .39435 .39617 .39796 ,39973 .40147 

1.3 ,40320 .40490 .40658 .40824 .40988 .41149 .41309 .41466 .41621 .41774 

1.4 .41924 .42073 .42220 .42364 .42507 .42647 .42786 .42922 .43066 .43189 

1.5 .43319 .43448 .43574 .43699 .43822 .43943 .44062 .44179 .44295 .44408 

1.6 .44520 .44630 .44738 .44845 .44950 .45053 .45154 .45254 .45352 .46449 

1.7 .45543 .46637 .45728 .45818 .45907 .46994 .46080 .46164 .46246 .46327 

1.8 .46407 .46485 .46562 .46638 .46712 .46784 .46866 ,46926 .46995 .47062 

1.9 .47128 .47193 .47257 .47320 .47381 .47441 .47600 .47558 .47615 .47670 

2.0 .47725 .47778 .47831 .47882 .47932 .47982 .48030 .48077 

2 1 .48214 .48267 .48300 ,48341 .48382 .48422 .48461 .48500 

2.2 .48610 .48645 .48679 .48713 .48745 .48778 .48809 .48840 

2.3 .48928 .48956 .48983 .49010 .49036 .49061 .49086 .49111 

2.4 .49180 .49202 .49224 .49245 .49266 .49286 .49305 .49324 

.49446 .49461 .49477 .49492 .49506 .49620 

,49585 .49598 .49609 .49621 .49632 .49643 

,49693 .49702 .49711 .49720 .49728 .49736 

,49774 .49781 .49788, .49795 .49801 .49807 

,49836 .49841 .49846* .49851 .49866 .49861 

.49882 .49886 ,49889 .49893 .49897 .49900 

.49916 ,49918 .49921 .49924 .49926 .49929 

.49940 .49942 .49944 .49946 .49948 .49960 

.49958 .49960 .49981 ,49962 .49964 .49965 

.49971 .49972 .49973 .49974 .49975 .49976 


48124 .48169 
48537 .48674 
48870 .48899 
49134 .49168 
49343 .49361 


,49413 .49430 

,49560 .49573 

,49674 .49683 

,49760 .49767 

,49825 .49831 


49396 

49547 

,49664 

49752 

49819 


2.5 .49379 

2.6 .49534 

2.7 .49653 

2.8 .49744 

2.9 .49813 


,49874 .49878 
,49910 .49913 
,49936 .49938 
.49955 .49957 
,49969 .49970 


3.0 .49865 

3.1 .49903 

3.2 .49931 

3.3 .49962 

3.4 .49966 


,49869 

,49906 

49934 

.49953 

.49968 





Appendix Table II ^ 


Table of t 


tl 

P = .05 

.02 

. ,..01 

1 

12.706 

31.821 

63.657- 

2 

4.303 

6.965 

9.925 

3 

3.182 

4.541 

5.841 

4 

2.776 

3.747 

4.604 

5 

2.571 

3.365 

4.032 

6 

2.447 

3.143 

3.707 

7 '■ 

2.365 

2.998 

3.499 

8 

2.306 

2.896 

3.355 

9 

2.262 

2.821 

3.250 

10 

2.228 

2.764 

3.169 

11 

2.201 

2.718 

3.106 

12 

2.179 

2.681 

3.055 

13 

2.160 

2.650 

3.012 

14 

2.145 

2.624 

2.977 

15 

2.131 

2.602 

2.947 

16 

2. 120 

2.583 

2.921 

17 

2.110 

2.567 

2.898 

18 

2.101 

2.552 

2.878 

19 

2.093 

2.539 

2.861 

20 

2.086 

2.528 

2.845 

21 

2.(t80 

2.518 

2.831 

22 

2.074 

2.508 

2.819 

23 

2.069 

2.500 

2.807 

24 

2.064 

2.492 

2.797 

25 

2.060 

2.485 

2.787 

26 

2.056 

2.479 

2.779 

27 

2.052 

2.473 

2.771 

28 

2.048 

2.467 

2.763 

29 

2,045 

2.462 

2.756 

30 

2.042 

,2.457 

2.750 

00 

1.95996 

2.32634 

2,57582 


^ Excerpte from Table I¥, R. A, Fisher, SMstical Methods for Research 
Wmrk&$, Thwe exearpts are priEted here through the courtesy of Dr. Fisher 
and M« publishers, Oliver aud Boyd, of Edinburgh. 






Appendix Table III ^ 

Values of the Correlation Coefficient for Different Levels of 
Significance 


.9998766 

.990000 

,95873 

91720 

8745 

8343 

7977 

7640 

7348 

7079 


For a total correlation, n is 2 less than the number of pairs in the 
sample; for a partial correlation, the number of eliminated variates also 
should be subtracted. 

1 Excerpts from Table V~A, R. A. Fisher, SiaiuHcal Metho k for Resemch 
Workers, These excerpts are printed here througii the courtesy oi Dr. Fisher 
and his publishers, Oliver and Boyd, of Kdinburgh. 
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Appendix Table IV 


Showing the Relations between r and z for Values of z from 0 to 5 ^ 


z., 

.00 

.01 

,02 

.03 

.04 

.05 .06 .07 

.08 

.09 

.0 

.0000 

.0100 

.0200 

.0300 

.0400 

.0500 .0599 .0699 

.0798 

.0898 

,1 

.0997 

.1096 

.1194 

.1293 

.1391 

.1489 .1587 .1684 

.1781 

. 1878 

.2 

.1974 

.2070 

.2165 

.2260 

.2355 

.2449 .2543 .2636 

.2729 

.2821 

.3 

.2913 

.3004 

.3095 

.3185 

.3275 

.3364 ,3452 .3540 

.3627 

.3714 

.4 

,3800 

.3885 

.3969 

.4053 

.4136 

.4219 .4301 .4382 

.4462 

.4542 

.6 

.4621 

,4700 

.4777 

.4854 

.4930 

.5005 .5080 ,5154 

.6227 

.5299 

,6 

.5370 

.5441 

.5511 

.5581 

,5649 

.5717 .5784 .5850 

.5915 

.5980 

,7 

.6044 

.6107 

.6169 

.6231 

.6291 

.6352 .6411 .6469 

.6527 

.6684 

.8 

.6640 

.6696 

.6751 

.6805 

.6858 

.6911 .6963 ,7014 

.7064 

,7114 

.9 

.7163 

.7211 

,7259 

.7306 

.7352 

.7398 .7443 .7487 

.7531 

.7674 

1.0 

,7616 

.7658 

.7699 

.7739 

.7779 

.7818 .7857 .7895 

.7932 

.7969 

1.1 

.8005 

.8041 

,8076 

.8110 

.8144 

.8178 .8210 .8243 

.8275 

.8306 

1.2 

.8337 

.8367 

.8397 

,8426 

.8455 

.8483 .8511 .8538 

.8565 

.8591 

1.3 

.8617 

.8643 

.8668 

.8693 

.8717 

.8741 .8764 .8787 

,8810 

.8832 

1,4 

.8854 

.8875 

.8896 

.8917 

.8937 

.8957 .8977 .8996 

.9015 

.9033 

1,5 

.9052 

.9069 

.9087 

.9104 

.9121 

.9138 .9154 .9170 

.9186 

.9202 

1.6 

.9217 

.9232 

,9246 

.9261 

.9275 

.9289 .9302 .9316 

.9329 

.9342 

1,7 

.9354 

.9367 

.9379 

.9391 

.9402 

.9414 .9426 .9436 

.9447 

,9458 

1.8 

..9468 

,9478 

.9498 

.9488 

.9508 

.9618 .9527 .9536 

.9545 

.9564 

1.9 

.9562 

.9571 

.9579 

.9587 

.9595 

.9603 .9611 .9619 

.9626 

.9633 

2.0 

.9( H 0 

.9647 

.9654 

.9661 

.9668 

.9674 .9680 .9687 

.9693 

.9699 

2.1 

.9705 

.9710 

.9716 

.9722 

.9727 

.9732 .9738 .9743 

.9748 

.9763 

2.2 

.9757 

.9762 

.9767 

.9771 

.9776 

.9780 .9785 .9789 

.9793 

.9797 

2.3 

.9801 

.9806 

.9809 

.9812 

.9816 

.9820 .9823 .9827 

.9830 

.9834 

2.4 

,9837 

.9840 

.9843 

.9846 

.9849 

.9852 .9855 .9858 

.9861 

,9863 

2,6 

,9866 

.9869 

.9871 

.9874 

.9876 

.9879 .9881 .9884 

.9886 

.9888 

2.6 

.9890 

.9892 

.9895 

.9897 

.9899 

,9901 .9903 .9905 

,9906 

.9908 

2.7 

.9910 

.9912 

.9914 

.9916 

,9917 

.9919 .9920 .9922 

.9923 

.9925 

2,8 

,9926 

,9928 

.9929 

.9931 

.9932 

.9933 .9936 .9936 

.9937 

.9938 

2.9 

.9940 

.9941 

.9942 

.9943 

.9944 

.9946 .9946 .9947 

.9949 

,9960 


3.0 .9951 

4.0 .9993 
B.O ,9999 

^The figures in the body of the table are values of r corresponding to 
z-values read from the scales on the left and top of the table. 
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Appendix Table V ^ 


Table of 


n 

P == .99 

.95 

.50 

.10 

.05 

.02 

.01 

1 

.000157 

.00393 .455 

2.706 

3.841 

5.412 

6.635 

2 

.0201 

.103 

1.386 

4.605 

5.991 

7.824 

9.210 

3 

.115 

.352 

2.366 

6.251 

7.815 

9 . a 37 

11.341 

4 

.297 

.711 

3.357 

7.779 

9.488 

11.668 

13.277 

5 

: .554 ■ 

1.145 

4.351 

9.236 

11.070 

13.388 

15.086 

6 

.872 

1.635 

5.348 

10.645 

12.592 

15.033 

10.812 

7 

1,239 

2.167 

6.346 

12.017 

14,067 

16.622 

18.475 

8 

1.646 

2.733 

7.344 

13.362 

15.507 

18. 168 

20.090 

9 

2.088 

3.325 

8.343 

14.684 

16.919 

19. 679 

21.666 

10 

2.558 

3.940 

9.342 

15.987 

18.307 

21.161 

23.209 

11 

3.053 

4.575 

10.341 

17.275 

19.675 

22. 618 

24.725 

12 

3.571 

5.226 

11.340 

18.549 

21.026 

24.054 

26.217 

13 

4.107 

5.892 

12.340 

19.812 

22.362 

25.472 

27.688 

14 

4.660 

6.571 

13.339 

21.064 

23.685 

26.873 

29.141 

15 

5.229 

7.261 

14.339 

22.307 

24.996 

28:259 

30.578 

16 

5.812 

7.962 

15.338 

23.542 

26.296 

29.633 

32.000 

17 

6.408 

8.672 

16.338 

24.769 

27.587 

30.995 

33.409 

18 

7.015 

9.390 

17.338 

25.989 

28.869 

32.346 

34.805 

19 

7.633 

10.117 

18.338 

27.204 

30.144 

33.687 

36.191 

20 

8.260 

10.851 

19.337 

28.412 

31.^0 

35.020 

37.566 

21 

8.897 

11,591 

20.337 

29.615 

32.671 

36.343 

38.932 

22 

9.542 

12.338 

21.337 

30.813 

33.924 

37.659 

40.289 

23 

10.196 

13.091 

22.337 

32.007 

35.172 

38.968 

41.638 

24 

10.856 

13.848 

23.337 

33.196 

36.415 

40.270 

42,980 

25 

11,524 

14.611 

24.337 

34.382 

37.652 

41.566 

44.314 

26 

12.198 

15.379 

25.336 

35.563 

38.885 

42.856 

45. -642 

27 

12.879 

16.151 

26,336 

36.741 

40.113 

44.140 

46.963 

28 

13.565 

16.928 

27.336 

37.916 

41.337 

45.419 

48.278 

29 

14.256 

17.708 

28.336 

39.087 

42.557 

46.693 

49.588 

30 

14.953 

18.493 

29.336 

40.256 

43 . 7 ^ 

47.962 

50.892 

For larger values of n, 
as a normal deviate with 

the expr^ion V2X® 
unit standard error. 

— •s/2'fi 

- 1 may be used 


1 Excerpts from Table III, R. A. Fisher, SMiMkd Methods for Emmrch 
Workers. These excerpts are printed here through the courtesy of Dr. Fisher 
and his publishers, Oliver and Boyd, of Edinburgh. 
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Appendix Table YI ^ 


1 Per Cent Points of the Distribution of z 



Valms of Tlx 


h 

2. 

3. 

4. 

5. 

6. 

8. 

12. 

24. 

00. 


1 

4.1535 

4.2585 

4 . 2974 

4.3175 

4.3297 

4.3379 

4.3482 

4.3685 

4.3689 

4.3794 


2 

2.20.50 

2.2976 

2.2984 

2.2988 

2.2991 

2.2992 

2.2994 

2.2997 

2.2999 

2.3001 


3 

1 . 7649 

1.7140 

1.6915 

1.6786 

1.6703 

1.6645 

1.6569 

1.6489 

1,6404 

1.6314 


4 

1 . 5270 

1.4452 

1.4075 

1 . 3856 

1,3711 

1.3609 

1.3473 

1.3327 

1.3170 

1.3000 


■6 

1 . 3043 

1.2929 

1.2449 

1.2164 

1.1974 

1 . 1838 

1 . 1644 

1 . 1467 

1 . 1239 

1.0997 


6 

1.3103 

1.1955 

1.1401 

1 . 1068 

1.0843 

1.0680 

1.0460 

1.0218 

.9948 

.9643 


7 

1.2526 

1 . 1281 

1 . 0672 

1.0300 

1.0048 

.9864 

.9614 

.9336 

.9020 

.8658 


8 

1,2106 

1.0787 

1.0135 

.9734 

.9459 

.9269 

.8983 

.8673 

.8319 

.7904 


9 

1 . 1786 

1.0411 

.9724 

.9299 

.9006 

,8791 

,8494 

,8167 

.7769 

.7305 


10 

1 , 1535 

1.0114 

.9399 

.8954 

.8646 

.8419 

.8104 

.7744 

.7324 

.6816 


n 

1 . 1333 

.9874 

.9136 

.8674 

.8354 

.8116 

.'.7785 

.7406 

.6968 

.6408 


12 

1,1166 

.9677 

.8919 

.8443 

.8111 

.7864 

.7520 

.7122 

.6649 

.6061 


■■■ 18 

1.1027 

.9511 

,8737 

.8248 

.7907 

.7652 

.7295 

.6882 

.6386 

.5761 


14 

1.0000 

.9370 

.8581 

.8082 

.7732 

.7471 

.7103 

.6675 

.6159 

.5500 


15 

1.0807 

.9249 

.8448 

. 7939 

.7582 

.7314 

.6937 

.6406 

.5961 

.5269 

w 

16 

1,0710 

,9144 

.8331 

.7814 

.7450 

.7177 

.6791 

.6339 

.5786 

.5064 

o ' 

17 

1.0641 

.9051 

.8229 

.7705 

.7335 

.7057 

. 6663 

.6199 

. 5630 

.4879 

1 

18 

1.0572 

.8070 

.8138 

. 7607 

. 7232 

.6950 

,6549 

.6075 

.5516 

.4712 


10 

1 .0511 

. .8897 

.8057 

.7521 

.7140 

.6854 

.6447 

.5964 

.5366 

.4560 


20 

1.0457 

.8831 

.7985 

.7443 

.70.68 

,6768 

.6355 

.5864 

.5253 

.4421 


21 

1.0408 

.8772 

, 7920 

. 7372 

, 0984 

.6690 

.6272 

. .5773 

.5150 

.4294 


22 

1.0363 

.87.10 

. 7860 

. 7309 

.6016 

.6620 

.6106 

■ .6601 

.5056 

.4176 


23 

1.0322 

. 8670 

. 7806 

.7251 

. 6856 

.6555 

.6127 

.5615 

.4969 

.4068 


24 

1.0285 

.8626 

.7757 

.7197 

.6799 

.6496 

.6064 

.5545 

,4890 

.3967 


25 

1,0251 

.8685 

.7712 

.7148 

.6747 

.6442 

.6006 

.5481 

.4816 ! 

.3872 


26 

1.0220 

.8548 

.7670 

.7103 

.0699 

.6392 

.6952 

.6422 

.4748 

,3784 


27 

1.0101 

.8513 

.7631 

.7062 

.6665 

.6346 

.6902 

.6367 

.4685 

.3701 


28 

1.0164 

.8481 

.7695 

.7023 

.6514 

.6303 

'.6866 1 

.6316 

.4626 

.3624 


20 

1.0139 

.8451 

.7562 

.6987 

.6576 

.6263 

' .'6813 

.6269 

.4570 

.3550 


30 

1.0116 

.8423 

.7531 

.6954 

.6640 

.6226 

.6773 

.6224 

.4519 

.3481 


60 

.9784 

.8025 

.7086 

.6472 

.6028 

.5687 

.5189 

.4574 

.3746 

.2352 


eiO 

.0462 

.7636 

.6651 

.5999 

.5622 

.5162 

.4604 

.3908 

.2913 

0 


^ From Table VI, R. A. Fisher, Statistical Metkods for Research Workers.: 
This table is printed here through the courtesy of Dr, Fisher and his pub- 
lisherB, Oliver and Boyd, of Edinburgh. 



Appendix Table VII ^ 


5 Per Cent Points of the Distribution of z 



Values of ni 


1. 

2. 

3. 

4. 

6. 

6. 

8. 

12. 

24. 

■ «. 


1 

2.6421 

2.6479 

2.6870 

2.7071 

2.7194 

2.7276 

2.7380 

2.7484 

2.7688 

2.7693 


2 

1.4692 

1.4722 

1.4765 

1.4787 

1.4800 

1.4808 

1.4819 

1.4830 

1.4 S 40 

1.4861 


3 

1.1677 

1.1284 

1.1137 

1.1061 

1,0994 

1.0963 

1.0899 

1.0842 

1.0781 

1.0716 


4 

1.0212 

.9690 

.9429 

.9272 

.9168 

.9093 

,8993 

.8885 

.8767 

.8639 


6 

.9441 

.8777 

.8441 

.8236 

.8097 

.7997 

.7862 

.7714 

.7660 

.7368 


6 

.8948 

.8188 

.7798 

.7668 

.7394 

.7274 

.7112 

.6931 

.6729 

.6499 


7 

.8606 

,7777 

.7347 

.7080 

.6896 

.6761 

.6576 

.6369 

,6134 

,5862 


8 

.8366 

.7475 

.7014 

.6726 

.6526 

.6378 

.6175 

.5945 

.6682 

.5371 


0 

.8163 

.7242 

.6767 

.6450 

,6238 

.6080 

.5862 

.5613 

.6324 

.4979 


10 

.8012 

.7058 

.6653 

.6232 

.6009 

.6843 

.5611 

.5346 

.5036 

.4657 


U 

.7889 

.6909 

.6387 

.6056 

.5822 

.6648 

.5406 

.5126 

.4796 

,4387 


12 

.7788 

.6786 

.6250 

.5907 

.5666 

.6487 

.6234 

.4941 

.4592 

.4156 


13 

.7703 

.6682 

.6134 

.5783 

.6635 

.6360 

.5089 

.4785 

.4419 

.3957 


14 

.7630 

.6594 

.6036 

.5677 

.6423 

.6233 

.4984 

,4649 

.4269 

.3782 


15 

,7568 

.6518 

.6960 

.5586 

.6326 

.6131 

.4856 

.4632 

.4138 

.3628 


16 

.7514 

.6451 

.6876 

.6506 

.6241 

. 6042 

.4760 

.4428 

.4022 

.3490 


17 

.7466 

.6393 

.6811 

.5434 

.6166 

.4964 

.4876 

.4337 

.3919 

.3366 

1 

18 

.7424 

.6341 

.6753 

.6371 

.5099 

.4894 

.4802 

.4255 

.3827 

.3253 


19 

.7386 

.6296 

.6701 

1 .5316 

.6040 

.4832 ■ 

.4535 

.4182 

.3743 

.3151 


20 

.7362 

.6254 

.6654 

.5265 

.4986 

.4776 

,4474 

.4116 

. -..mm 

.3057 


21 

.7322 

.6216 

.5612 

.5219 

.4938 

: .4725 

,4420 

.4056 

' .3599 

,2971 


22 

.7294 

.6182 

.6674 

.6178 

.4894 

.4879 

.4370 

.4001 

.3536 

.2892 


23 

.7269 

.6161 

.5540 

.6140 

.4854 

.4636 

.4325 

.3950 

.3478 

.2818 


24 

.7246 

.6123 

. 5508 

.5106 

,4817 

.4698 

.4283 

.3904 

. 3426 

.2749 


25 

.7225 

.6097 

.5478 

.6074 

.4783 

.4562 

.4244 

.3862 

,8876 

.2686 


26 

.7206 

.6073 

.6451 

.5045 

.4752 

.4529 

.4209 

.3823 

.3830 

.2625 


27 

.7187 

.6051 

.6427 

.5017 

.4723 1 

1 .4499 

.4176 

.3786 

.3287 

.2569 


28 

.7171 

.6030 

.6403 

.4992 

.4696 , 

.4471 

.4146 

.3752 

.3248 

.2516 


29 

.7156 

.6011 

.6382 

.4969 

,4671 : 

.4444 

.4117 

.3720 

.3211 

.2468 


30 

.7141 

.5994 

,6362 

.4947 

.4648 

.4420 

.4090 

.3891 

.3176 

.2419 


60 

.6933 

.6738 

.6073 

.4632 

.4311 

.4064 

.3701 

.3255 

.2664 

.1644 


■ 00 

' .6729 

■ .5486, 

.4787 

.4319 

.3974 

.3706 

.3309 

1 .2804 

' .2085 

0 ; 


^ From Table VI, E. A. Fisher, Statistical Methods joir Research Workers. 
This table is printed here through the courtesy of Dr. Fisher and his pui>- 
lishers, Oliver and Boyd, of Edinburgh, 





Appendix Table VIII 


Squares of the Natural Numbers from 100 to 999 


n 

n 

« + 1 

n +2 

71 -f- 3 

— SQUA3 
71 4- 4 

EiEOP — 
77+5 

n 

+ 

6 


+ 

7 

n 

+ 

8 

n 

+ 9 

100 

1 00 00 

1 02 01 

1 04 04 

1 06 09 

1 08 16 

1 10 25 

1 

12 

36 

1 

14 

49 

1 

16 

64 

1 

18 81 

110 

1 21 00 

1 23 21 

1 25 44 

1 27 69 

1 29 96 

1 32 25 

1 

34 

56 

1 

36 

89 

1 

39 

24 

1 

41 61 

120 

1 44 00 

1 46 41 

1 48 84 

1 51 29 

1 53 76 

1 56 25 

1 

58 

76 

1 

61 

29 

1 

63 

84 

1 

66 41 

130 

1 69 00 

1 71 61 

1 74 24 

1 76 89 

1 79 66 

1 82 25 

1 

84 

96 

1 

87 

69 

1 

90 

44 

1 

93 21 

140 

1 96 00 

1 98 81 

2 01 64 

2 04 49 

2 07 36 

2 10 25 

2 

13 

16 

2 

16 

09 

2 

19 

04 

2 

22 01 

150 

2 25 00 

2 28 01 

2 31 04 

2 34 09 

2 37 16 

2 40 25 

2 

43 

36 

2 

46 

49 

2 

49 

64 

2 

52 81 

160 

2 56 00 
2 89 00 

2 59 21 

2 62 44 

2 65 69 

2 68 96 

2 72 25 

2 

75 

56 

2 

78 

89 

2 

82 

24 

2 

85 61 

170 

2 92 41 

2 95 84 

2 99 29 

3 02 76 

3 06 25 

3 

09 

76 

3 

13 

29 

3 

16 

84 

3 

20 41 

180 

3 24 00 

3 27 61 

3 31 24 

3 34 89 

3 38 56 

3 42 25 

3 

45 

96 

3 

49 

69 

3 

53 

44 

3 

57 21 

190 

3 61 00 

3 64 81 

3 68 64 

3 72 49 

3 76 36 

3 80 25 

3 

84 

16 

3 

88 

09 

3 

92 

04 

3 

96 01 

200 

4 00 00 

4 04 01 

4 08 04 

4 12 09 

4 16 16 

4 20 25 

4 

24 

36 

4 

28 

49 

4 

32 

64 

4 

36 81 

210 

4 41 00 

4 45 21 

4 49 44 

4 53 69 

4 57 96 

4 62 25 

4 

66 

58 

4 

70 

89 

4 

76 

24 

4 

79 61 

220 

4 84 00 

4 88 41 

4 92 84 

4 97 29 

5 01 76 

5 06 25 

5 

10 

76 

5 

15 

29 

5 

19 

84 

5 

24 41 

230 

5 29 00 

5 33 61 

6 38 24 

6 42 89 

5 47 56 

5 62 26 

6 

56 

96 

6 

61 

69 

5 

66 

44 

6 

71 21 

240 

6 76 00 

6 SO 81 

5 85 64 

5 90 49 

6 95 36 

6 00 25 

6 

05 

16 

6 

10 

09 

6 

15 

04 

6 

20 01 

260 

6 25 00 

6 30 01 

6 35 04 

6 40 09 

6 45 16 

6 50 25 

6 

55 

36 

6 

60 

49 

6 

65 

64 

6 

70 81 

260 

6 76 00 

6 81 21 

6 86 44 

6 91 69 

6 96 96 

7 02 25 

7 

07 

56 

7 

12 

89 

7 

18 

24 

7 

23 01 

270 

7 29 00 

7 34 41 

7 39 84 

7 45 29 

7 50 76 

7 56 25 

7 

61 

76 

7 

67 

29 

7 

72 

84 

7 

78 41 

280 

7 84 00 

7 89 61 

7 96 24 

8 00 89 

8 06 56 

8 12 25 

8 

17 

96 

8 

23 

69 

8 

29 

44 

8 

35 21 

290 

8 41 00 

8 46 81 

8 52 64 

8 58 49 

8 64 36 

8 70 25 

8 

76 

16 

8 

82 

09 

8 

88 

04 

8 

94 01 

300 

9 00 00 

9 06 01 

9 12 04 

9 18 09 

9 24 16 

9 30 25 

9 

36 

36 

9 

42 

49 

9 

48 

64 

9 

54 81 

310 

9 61 00 

9 67 21 

9 73 44 

9 79 69 

9 85 96 

9 92 25 

9 

98 

56 

10 

04 

89 

10 

11 

24 

10 

17 61 

320 

10 24 00 

10 30 41 

10 36 84 

10 43 29 

10 49 76 

10 56 25 

10 

62 

76 

10 

69 

29 

10 

75 

84 

10 

82 41 

330 

;10 89 00 

10 95 61 

11 02 24 

11 08 89 

11 15 56 

11 22 25 

11 

28 

96 

11 

35 

69 

11 

42 

44' 

11 

49 21 

840 

'll 56 00 

11 62 81 

11 69 64 

11 76 49 

n 83 36 

11 90 26 

11 

97 

16 

12 

04 

09 

12 

11 

04 

12 

18 01 

350 

12 25 00 

12 32 01 

12 39 04 

12 46 09i 

12 53 16 

'12 60 25 

12 

67 

36 

12 

74 

49 

12 

81 

64 

12 

88 81 

860 

!12 96 00 

13 03 21 

13 10 44 

13 17 69 

13-24 96 

|13 32 25 

13 

39 

56 

13 

46 

89 

13 

54 

24 

13 

61 61 

370 

13 69 00 

13 76 41 

13 83 84i 

13 91 29 

13 98 76 

14 06 26 

|14 

13 

76 

14 

21 

291 

14 

28 

84 

14 

36 41 

380 

14 44 00 

14 51 61 

14 59 24 

14 66 89 

14 74 56 

14 82 25 

[14 

89 

96 

!14 

97 

691 

IS 

06 

44 

16 

13 21 

890 

15 21 OOi 

16 28 81 

15 36 64 

15 44 49 

15 62 36 

15 60 25 

[15 

68 

16 

15 

76 

09; 

15 

84 

04 

15 

92 01 

400 

16 00 00 

16 08 01 

16 16 04 

16 24 09 

16 32 16 

16 40 25 

116 

48 

36 

16 

56 

49 

16 

64 

64 

16 

72 81 

410 

16 81 00 

16 89 21 

, 16 97 44 

17 05 69 

17 13 96 

17 22 25 

|17 

30 

56 

17 

38 

89 

17 

47 

24 

17 

55 61 

420 

17 64 00 

17 72 41 

17 80 84 

17 89 29 

17 97 76 

18 06 26 

18 

14 

76 

18 

23 

29 

18 

31 

84 

18 

40 41 

430 

18 49 00 

18 57 61' 

18 66 24 

18 74 89 

18 83 66 

18 92 26 

19 

00 

96 

19 

09 

69 

19 

18 

44 

19 

27 21 

440 

19 36 00 

19 44 81 

19 63 64 

19 62 49 

19 71 36i 

19 80 25 

19 

89 

16 

19 

98 

09 

20 

07 

04 

20 

16 01 

450 

20 25 00 

20 34 01 

20H3 04 

20 52 09 

20 61 16! 

20 70 26 

20 

79 

36 

20 

88 

49 

20 

97 

64 

21 

06 81 

460 

21 16 00 

21 26 21 

21 34 44 

21 43 69 

21 62 96 

21 62 25* 

21 

71 

56: 

21 

80 

89 

21 

90 

24 

21 

99 61 

470 

22 09 00 

22 18 41 

22 27 84 

22 37 29 

22 46 76 

22 66 26 

22 

65 

76 

22 

75 

29 

22 

84 

84 

22 

94 41 

480 

23 04 00 

23 13 61 

23 23 24 

23 32 89 

23 42 66 

23 62 25 

23 

61 

96 

23 

71 

69 

23' 

81 

44 

23 

91 21 

490 

24 01 00 

24 10 81 

24 20 64 

24 30 49 

24 40 36 

24 60 25 

24 

60 

16 

24 

70 

09 

24 

80 

04 

24 

90 01 

500 

25 00 00 

25 10 01 

25 20 04 

25 30 09 

25 40 16 

25 60 26 

25 

60 

36 

25 

70 

49 

25' 

80 

64 

25 

90 81 

510 

26 01 00 

26 11 21 

26 21 44 

26 31 69 

26 41 96 

26 62 25 

26 

62 

56 

26 

72 

89 

26 

83 

24 

26 

93 61 

520 

27 04 00 

27 14 41 

27 24 84 

27 35 29 

27 45 76 

27 66 25 

27 

66 

76 

27 

77 

29 

27 

'87 

84 

27 

98 41 

530 

28 09 00 

‘ 28 19 61 

28 30 24 

28 40 89 

28 51 56 

28 62 25 28 

72 

96 

28 

83 

69 

28 

94 

44 

29 

05 21 

540 

129 16 00 

» 29 26 81 

29 37 64 

29 48 49 

29 59 36 

29 70 25129 

81 

16 

29 

92 

09 

30 

03 

04 

30 

■■14 01 



Appenwx Table Ylll—Cordinued 
Squares of the Natural Numbers from 100 to 999 

< ^ ^ SQUARE OF — 

« n jn4-ljw+2|n+3jn4"4jn-f'6|n+6Jn+7jn-4'8jw+9 


550 30 26 00 30 36 01 30 47 04 30 58 09 30 69 

560 31 36 00 31 47 21 31 58 44 31 69 69 31 80 

570 32 49 00 32 60 41 32 71 84 32 83 29 32 94 

680 33 64 00 33 75 61 33 87 24 33 98 89 34 10 

590 34 81 00 34 92 81 35 04 64 36 16 49 35 28 

600 36 00 00 36 12 01 36 24 04 36 36 09 36 48 

610 37 21 00 37 33 21 37 46 44 37 57 69 37 69 

620 38 44 00 38 56 41 38 68 84 38 81 29 38 93 

630 39 69 00 39 81 61 39 94 24 40 06 89 40 19 
640 40 96 00 41 08 81 41 21 64 41 34 49 41 47 
660 42 25 00 42 38 01 42 51 04 42 64 09 42 77 
660 43 56 00 43 69 21 43 82 44 43 95 69 44 08 
670 44 89 00 45 02 41 46 16 84 46 29 29 45 42 
680 46 24 00 46 37 61 46 51 24 46 64 89 46 78 
690 47 61 00 47 74 81 47 88 64 48 02 49 48 16 
700 49 00 00 49 14 01 49 28 04 49 42 09 49 66 
710 50 41 00 50 55 21 50 69 44 50 83 69 50 97 
720 51 84 00 51 98 41 62 12 84 52 27 29 62 41 
730 63 29 00 53 43 61 63 58 24 53 72 89 53 87 
740 54 76 00 54 90 81 55 05 64 55 20 49 56 36 
750 56 26 00 56 40 01 56 56 04 56 70 09 66 86 
760 57 76 00 67 91 21 68 06 44 68 21 69 58 36 
770 59 29 00 69 44 41 69 69 84 59 76 29 59 90 
780 60 84 00 60 99 61 61 15 24 61 30 89 61 46 
790 62 41 00 62 66 81 62 72 64 62 88 49 63 04 
800 64 00 00 64 16 01 64 32 04 64 48 09 64 64 
810 65 61 00 65 77 21 65 93 44 66 09 69 66 25 
820 67 24 00 67 40 41 67 56 84 67 73 29 67 89 
830 68 89 00 69 05 61 69 22 24 69 38 89 69 65 
840 70 56 00 70 72 81 70 89 64 71 06 49 71 23 
850 72 25 00 72 42 01 72 69 04 72 76 09 72 93 
860 73 96 00 74 13 21 74 30 44 74 47 69 74 64 
870 75 69 00 76 86 41 76 03 84 76 21 29 76 38 
880 77 44 00 77 61 61 77 79 24 77 96 89 78 14 
890 79 21 00 79 38 81 79 56 64 79 74 49 79 92 
900 81 00 00 81 18 01 81 36 04 81 54 09 81 72 
910 82 81 00 82 99 21 83 17 44 S3 35 69 83 53 
920 84 64 00 84 82 41 85 00 84 86 19 29 85 37 
930 86 49 00 86 67 61 86 86 24 87 04 89 87 23 
940 88 36 00 88 54 81 88 73 64 88 92 49 89 11 
960 90 26 00 90 44 01 90 63 04 90 82 09 91 01 
960 92 16 00 92 36 21 92 54 44 92 73 69 92 92 
970 94 09 00 94 28 41 94 47 84 94 67 29 94 86 
980 96 04 00 96 23 61 96 43 24 96 62 89 96 82 
990 98 01 00 98 20 81 98 40 64 98 60 49 98 80 


16 30 80 26 30 91 36 31 02 49 31 13 64 31 24 81 

96 31 92 26 32 03 68 32 14 89 32 26 24 32 37 61 

78 33 06 25 33 17 76 33 29 29 33 40 84 33 62 41 

66 34 22 25 34 33 96 34 45 69 34 57 44 34 69 21 

36 35 40 25 35 62 16 35 64 09 35 76 04 35 88 01 

16 36 60 26 36 72 36 36 84 49 36 96 64 37 08 81 

96 37 82 25 37 94 56 38 06 89 38 19 24 38 31 61 

76 39 06 26 39 18 76 39 31 29 39 43 84 39 56 41 

66 40 32 25 40 44 96 40 57 69 40 70 44 40 83 21 

36 41 60 25 41 73 16 41 86 09 41 99 04 42 12 01 

16 42 90 25 43 03 36 43 16 49 43 29 64 43 42 81 

96 44 22 25 44 35 66 44 48 89 44 62 24 44 76 6 1 

76 45 66 25 45 69 78 45 83 29 46 96 84 46 10 41 

66 46 92 25 47 05 96 47 19 69 47 33 44 47 47 21 

36 48 30 25 48 44 16 48 68 09 48 72 04 48 86 01 

16 49 70 25 49 84 36 49 98 49 50 12 64 50 26 81 

96 61 12 25 61 26 66 51 40 89 51 55 24 61 69 61 

76 62 56 26 52 70 76 62 85 29 52 99 84 53 14 41 

56 64 02 26 64 16 96 64 31 69 54 46 44 54 61 21 

36 65 50 25 65 65 16 65 80 09 55 95 04 56 10 01 

16 57 00 26 57 16 36 67 30 49 57 45 64 57 60 81 

96 68 52 25 68 67 56 68 82 89 58 98 24 69 13 61 

76 60 06 25 60 21 76 60 37 20 60 52 84 60 68 41 

66 61 62 25 61 77 96 61 93 69 62 09 44 62 25 21 

36 63 20 25 63 36 16 63 52 09 63 68 04 63 84 01 

16 64 80 26 64 96 36 65 12 49 65 28 64 65 44 81 

96 66 42 26 66 58 56 66 74 89 66 91 24 67 07 61 

76 68 06 26 68 22 76 68 39 29 68 55 84 68 72 41 

56 69 72 25 69 88 96 70 06 69 70 22 44 70 30 21 

36 71 40 25 71 57 16 71 74 09 71 91 04 72 08 01 

16 73 10 26 73 27 36 73 44 49 73 61 64 73 78 81 

96 74 82 26 74 99 56 75 16 89 75 34 24 76 51 61 

76 76 56 25 76 73 76 76 01 29 77 08 84 77 26 41 

66 78 32 25 78 49 96 78 67 69 78 85 44 79 03 21 

36 80 10 26 80 28 16 80 46 09 80 64 04 80 82 01 

16 81 90 26|82 08 36 82 26 49 82 44 64 82 62 81 
96 83 72 25 83 90 56 84 08 89 84 27 24 84 45 61 

76 85 66 25 86 74 76 85 93 29 86 U 84 86 30 41 

56 87 42 25 87 60 96 87 79 69 87 98 44 88 17 21 

36 89 30 25 89 49 16 89 68 09 89 87 04 90 06 01 

16 91 20 25 91 39 36 91 58 49 91 77 64 01 96 81 

96 93 12 25 98 31 56 93 « 89 93 TO 24 93 80 61 

76 95 06 25 95 25 76 95 45 29 95 64 84 05 84 41 

56 97 02 25 97 21 96 97 41 69 97 61 44 97 81 21 

36 99 00 25 99 20 16 99 40 09 89 60 04 99 80 011 



Appendix Table IX 


Sums of the First Three Powers of the Natural Numbers from 1 to 50 


n 

S(n) 


S('w3) 

71 

S(«) 

2(n^) 

Z(n^) 

1 

1 

1 

1 

26 

351 

6 201 

123201 

2 

3 

5 

9 

27 

378 

6 930 

142 884 

3 

6 

14 

36 

28 

406 

7 714 

164 836 

4 

10 

30 

100 

29 

435 

8 555 

189 225 

5 

15 

55 

225 

30 

465 

9 455 

216 225 

6 

21 

91 

441 

31 

496 

10 416 

246 016 

7 

;28'- 

140 

784 

32 

528 

11 440 

278 784 

8 

36 

204 

1 296 

33 

561 

12 529 

314 721 

9' ■ 

46 

285 

2 025 

34 

595 

13 685 

354025 

10 

'55 : 

385 

3 025 

35 

630 

14 910 

396 900 

11 

66 

606 

4 356 

36 

666 

16 206 

443 556 

12 

78 

650 

6 084 

37 

70S 

17 575 

494 209 

13 

91 

■'■819 ■! 

8 281 

38 

i 741 

19 019 

549 081 

14 ''i 

106 

1015 ' 

11 025 

39 

780 

20 540 

608 400 

15 ^ 

120 

1 240 

14 400 

40 

820 

22 140 

672 400 

16 

136 

1 496 

18 496 

41 

861 

23 821 

741 321 

17 

163 

1 785 

23 409 

42 

903 

25 585 " 

815 409 

18 

171 

2 10^ 

29 241 

43 

946 

27 434 i 

894 916 

19 

190 

2 470 

36 100 

44 

990 

29 370 

980 100 

20 

210 

2 870 

44 100 

46 

1 035 

31 395 

1 071 225 

21 

231 

3 311 

53 361 

46 

1 081 

33 511 

1 168 561 

22 

263 

3 795 

64 009 

47 

1 128 

35 720 

1 272 384 

23 

276 

4 324 

76 176 

48 

1 176 

38 024 

1 382 976 

24 

300 

4 900 

90 000 

49 

1 225 

40 425 

1 500 625 

26 

325 

5 525 

106 625 

50 

1 275 

42 925 

1 625 625 


roa 



; ■: Appendix Tabde X ; 
Five-Place Logarithms of Numbers 


100-160 


D O 

n 


3 

4 

6 

0 

00 000 

043 

087 

130 

173 

217 

260 

432 

475 

518 

561 

604 

647 

689 

860 

903 

945 

988 

*030 

*072 

*115 

01 284 

326 

368 

410 

452 

494 

636 

703 

745 

787 

828 

870 

912 

953 

02 119 

160 

202 

243 

284 

325 

366 

531 

572 

612 

653 

694 

735 

776 

938 

979 

*019 ■ 

*060 

*100 

*141 

*181 

03 342 

383 

423 

463 

503 

543 

583 

743 

782 

822 

862 

902 

941 

981 

04 139 

179 

218 

258 

297 

336 

376 

632 

571 

610 

650 

689 

' 727 

766 

922 

961 

999 

*038 

*077 

*116 

*154 

06 308 

346 

385 

423 

461 

500 

538 

690 

729 

767 

805 

843 

881 

918 

06 070 

108 

146 

183 

221 

258 

296 

446 

483 

621 

558 

696 

633 

670 

819 

856 

893 

930 

967 

*004 

*041 

07 188 

226 

282 

298 

335 

372 

408 

655 

591 

628 

664 

700 

737 

773 


732 

776 

817 

*157 

*199 

*242 

578 

620 

662 

995 

*036 

*078 

407 

449 

490 

816 

867 

898 

*222 

*262 

*302 

623 

663 

703 

*021 

*060 

*100 




415 

454 

493 


*171 * a 07 *243 


428 461 494 628 661 694 


Prop. Pts. 


1 4.4 4.3 4.2 

2 8.8 8.6 8.4 

3 13.2 12.9 12.6 

4 17.6 17.2 16.8 
6 22.0 21.6 21.0 

6 26.4 26.8 26.2 

7 30.8 30.1 29.4 

8 36.2 34.4 33.6 

9 39.6 38.7 37.8 


1 4.1 4.0 3.9 

2 8.2 8.0 7.8 

3 12.3 12.0 11.7 

4 16,4 16,0 15.6 
& 20.6 20.019.5 

6 24,6 24.0 23.4 

7 28.7 28.0 27.3 
a 32.8 32,0 31.2 
9 36.9 36.036.1 


1 3.8 3.7 3.6 

2 7.6 7.4 7.2 

3 n.4 11.110.8 

4 16.2 14.8 14.4 ' 
6 19.018.818.0 

6 22.8 23.2 f l.f ■ 

7 26.6 26.9 |6i 

8 30.4 29.6 is.a 

9 34.2 $3.1 la.4 


Pl ; 1:1 

I llliliil 


4 12.8 l|.4 12.0 
6 16.018.616.0 

8 19,2 18,618,0 
7 22.4 21.7 21,0 
$ 26 . 6 l 4 . l 3 c 0 

9 l2SJi27i27J 


726 764 782 8 U S 40 869 


709 























Appendix Table X 
















Appendix Table X 

Five-Place Logarithms of Numbers 


200-260 



L.V'O":; 


2 

3 

4 

0 

6 

7 

8 

9 

Prop.Pts. 

20Q 

30103 

126 

146 

168 

190 

211 

233 

255 

276 

298 





201 

320 

341 

363 

384 

406 

428 

449 

471 

492 

514 





202 

535 

657 

578 

600 

621 

643 

664 

685 

707 

728 


49 t 

21 

203 

750 

771 

792 

814 

835 

856 

878 

899 

920 

942 





204 

963 

984 

*006 

*027 

*048 

*069 

*091 

*112 

*133 

*154 

1 

2.2 

2.1 












2 

4.4 

4.2 

205 

31 175 

197 

218 

239 

260 

281 

302 

323 

345 

366 

3 

6.6 
8.8 
11 0 

6.3 

8.4 

10.5 

2,6 

206 

387 

408 

429 

450 

471 

492 

513 

534 

555 

576 

5 

207 

697 

618 

639 

660 

681 

702 

723 

744 

765 

785 

6 

13.2 

208 

806 

827 

848 

869 

890 

911 

931 

952 

973 

994 

7 

11 

.4 1 

4.7 

209 

32 015 

035 

056 

077 

098 

118 

139 

160 

181 

201 

8 

9 

17.61 
19.8 ] 

8.9 

210 

222 

243 

263 

284 

305 

325 

346 

366 

387 

408 




211 

428 

449 

469 

490 

510 

531 

552 

572 

593 

613 





212 

634 

654 

675 

695 

715 

736 

756 

777 

797 

818 



ao 

213 

838 

858 

879 

899 

919 

940 

960 

980 

*001 

*021 

1 

2 



214 

33 041 

062 

082 

102 

122 

143 

163 

183 

203 

224 

is.tl 

4.0 

215 

244 

264 

284 

304 

325 

345 

365 

385 

405 

425 

3 

4 


216 

445 

465 

486 

506 

526 

646 

566 

586 

606 

626 



10.0 

217 

646 

666 

686 

706 

726 

746 

766 

786 

806 

826 



U. 

0 

218 

846 

866 

885 

905 

925 

945 

965 

985 

*005 

*025 



14^ 

ft 

219 

34 044 

064 

084 

104 

124 

143 

163 

183 

203 

223 



11; 

0 

220 

242 

262 

282 

301 

321 

341 

361 

380 

400 

420 





221 

439 

459 

479 

498 

518 

537 

657 

677 

"W 

616 



M 

222 

635 

655 

674 

694 

713 

733 

763 

772 

792 

811 



223 

830 

850 

869 

889 

908 

.928 

947 

967 

986 

*005 



1. 

1: 

9 

224 

35 025 

044 

064 

083 

102 

122 

141 

160 

180 

199 



8 

225 

218 

238 

257 

276 

295 

315 

334 

353 

372 

392 

i 


f- 

1 

226 

411 

430 

449 

468 

488 

507 

526 

646 

564 

583 

1 


IX. 

1 

227 

603 

622 

641 

660 

679 

698 

717 

736 

755 

774 



13. 

1 

228 

793 

813 

832 

851 

870 

889 

I 908 

927 

946 

965 



!§• 

2 

229 

984 

*003 

*021 

*040 

*059 

*078 

*097 

*116 

*135 

♦154 



17. 

1 

230 

36 173 

192 

211 

229 

248 

267 

286 

305 

324 

§42 





231 

361 


399 

1418 

436 

456 

474 

493 

"W* 

630 



u 


232 

549 1 

568 

586 

605 

624 

642 

661 

680 

698 

717 



L 


233 

1 ' . 736 

754 

773 

791 

810 

829 

847 

866 

884 

903 



1 

234 

• 922 : 

940 

959 

977 

996 

*014 

*033 

*051 

*070 

*088 



I 

l 

235 

37 107 

125 

144 

162 

181 

199 

218 

236 

254 

273 


\ 


i 

236 

291 

310 

328 

346 

365 

383 

401 

420 

438 

457 



10. 

1 

237 

475 

403 

511 

530 

5*18 

566 

585 

603 

621 

639 



12. 

8 

238 

658 

676 

694 

712 

731 

749 

767 

785 

803 

822 



il; 

1 

239 

840 

858 

876 

894 

912 

931 : 

949 

967 

985 

*003 




240 

38 021 

039 

057 

075 

093 

112 

130 

148 

166 

184 










328 


364 




241 ' 

202 

220 

238 

256 

274 

292 

310 

346 



mm 


242 

382 

399 

417 

435 

453 

471 

489 

507 

525 

543 ' 



1. 

f 

243 

244 

245 

246 

247 

561 

739 

917 
39 094 
270 

578 

757 

934 

111 

287 

596 

775 

952 

129 

305 

614 

792 

970 

146 

322 

632 

810 

987 

164 

340 

650 

828 

*005 

182 

358 

668 

846 

*023 

199 

375 

686 

863 

*041 

217 

393 

703 

881 

*058 

235 

410 

721 

899 

*076 
252 ' 
428 . 


1 

1 

if; 

h. 

1 

» 

8 

248 

445 

463 

480 

498 

515 

633 

550 

568 

m 

602 



15. 

8 

249 

620 

637 

655 

672 

690 

707 

724 

742 

759 

777 





260 

794 

811 

829 

846 

863 

881 

898 

915 

933 

950 





K 

L 0 

t 

2 

3 

4 

0 

0 

7 

8 

f 

Pw|>. Pt». 


rii 



Appendix Table X 

Five-Place Logarithms of Numbers 

260-300 


39 794 1 811 I 829 i 846 
967 
40140 
312 
483 


664 
824 
993 

258 I 41 162 

259 330 


915 933 950 


614 631 647 


824 840 855 



^ ■' • 

Prop 

. Pts. . 




IS 

1 

1.8 ■ ■ 

2 

3.6 

3 

5.4 

4 

7.2 • ■ ■ 

5 

9.0 

6 

lO.S 

7 

12.6 

. ' 8 

14.4 

9 

16.2 


17 

1 

1.7 

2 

3.4 

3 

5.1 

4 

6.8 

5 

8.5 

6 

10.2 

7 

11.9 

8 

13.6 

0 

15.3 


IS 

1 

1.6 

2 

3.2 

3 

4.8 

4 

6.4 

5 

8.0 

6 

9.6 

7 

11.2 

8 

12.8 

9 

14.4 


15 

1 

1.6 

2 

3.0 

3 

4.6 

4 

6.0 

5 

7.6 

6 

9.0 

7 

10.5 

8 

12.0 

9 

13.5 


14 

1 1 

1.4 

■ 2 

2.8 

3 

4.2 

4 

5.6 

■ ' 6 

7.0 

6 

8.4 

7 

9.8 

8 

11.2 

9 

U2.6 


Prop. Pti. 

















Appendix Table X 


Five-Place Logarithms of Numbers 


300-360 


.49715 































Appendix Table X 
Five-Place Logarithms of Numbers 

















Appendix Table X 



Five-Place Logarithms of Numbers 

450-600 













72 016 


73 


, Appendix Tabdb-X 
FwerPlace Logarithms of Numbers 

600-660 


500 

501 

502 

603 

604 

605 

506 

507 
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Deming, W. E., 469, 470; Chi-square 
test, 629 

Dennis, Samuel J., 218, 

Dependent variable, sh Variable 
Depreciation, 79 L 
Derivative, partial, p3|| 643 
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Equation of regression, see Regression 
Error, normal curve of, see Normal 
law of error 

Error, sampling, see Standard error 
Estimate, maldng of, 332 ff., 566 ff.; 

zone of, 349, 571 ff., 590 ff. 
Exchange rates, distribution of, 94 
Expected value, 294 
Exponential curve, 19, 28, 258, 266, 
271; modified, 272, 667; see oho 
Logarithmic curve 
Exponent, logarithmic, 23 
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Exports, statistics of, 36, BS 
Extrapolation, 264, 277, 675 
Ezekiel, Mordecai, 412, 477, 547, 564, 
652; multiple correlation analysis, 
537 ff.; correction of standard error, 
542; correction of the correlation 
coefficient, 544 

Factor reversal test, 199 
Falkner, Helen D, , 288 
Farm price index number, 222 . 
Federal Reserve Board, index of pro* 
duction, 316 ff. 

Fisher, Arne, 448 

Fisher, Irving, 181 ; time reversal test, 
190; weighted index numbers, 195, 
196, 204; factor reversal test, 199; 
ideal index number, 201 
Fisher, R. A., 270, 479, 603; statistical 
population,. 456; null hypothesis, 
475; analysis of variance, 490 ff.; 
z table, 499, 704-5; extension of z 
' table, 518; t table, 603, 700; sig- 
nificance of the correlation coeffi- 
cient, 611; Chi-square table, 625, 
628, 703 

FrarJer, Edward K., 105 
Frequency curve, 41, 82; polygon, 
41,67,^5,88,91,93 
Frequency distribution, 50 ff . ; pur- 
pose of, 56; comparison of, 86; gen- 
eral characteristics of, 97 ff.; see 
oho Distribution 

Frequency, theoretical, and actual, 
431 

Friedman, Milton, 521 
B’unctional relationship, 12, 389; lin- 
ear, se£ Linear relationship; para- 
bolic, see Parabolic relationship 

CMton, Franck, lines of regression, 

m 

CItnat, H. L., progrem chart, 46 

Karl Friedrich, normal law of 

, Citor, 4B5 

125 ff , ; definition of, 
125; oowpulwlloii of, 126; chamc- 
IffWw of, 127^ 135; as mmsure of 
wniml 129; m average 

, 185; of loptrith** 

, ’ iiitc iS4 

, Oeowirii 1% Mf 271* 


Glover, James W., 269, 437 
Gompertz curve, 272, 671 
Goodness of fit, 447; criteria of, 276; 

Chi-square test of, 626 ffi 
Graphic method, of locating aver- 
ages, 120 ffi; in multiple correla- 
tion, 564 

Graphic presentation, 8 fi,; of fre- 
quency distributions, 63; of time 
series, 227 

Grouping of data, 53, 112; ungrouped 
data, 109; effect on mode, 115; in 
correlation tables, 340, 354 
Growth curves, Gompertz, 272, 671; 
modified exponential, 272, 667 fi. ; 
logistic, 272, 675 ff. 

Hall, Lincoln W., 288 
Harmonic equation, 579 
Harmonic mean, 132 fi. ; character- 
istics of, 135; of relative prices, 
186; of reciprocal observations, 585, 
587 

Hart, Hornell, reliability of a per- 
centage, 483 

Height distribution, 87, 360 
High contact, of frequency distribu- 
tions, 443 

Histogram, 64 fi.; see also Column 
diagram 

Homogeneity, 487; tests for, 120, 
630 ff.; in time series, 301; in sam- 
pling procedure, 462, 607; Chi- 
square test of, 633 ft. 

Hotelling, Harold, 378, 479 
Hyperbolic curve, 16, 28, 569 

Ideal index, 201; for the me^urement 
of production, 307 

Income distribution, statistics of, 71, 
97, 102 

Independence, tests of, 630 fi. 
Independent variable, see Variable 
Index numbers, 18; nature of, 161 fi.; 
** ideal, 201; use of aggregate, 
165; of retail price, 2M; of cost of 
living, 221; of farm price, 222 ff*; of 
msonal variation, 287 ff.; of in- 
dustrial activity, 312 ff., 390, 393; 
of stock prl,ow* 390 ff . 

Index numbem of production, TO If*; 
unwijusted, ^6; adjusted, 310; 
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Federal Reserve Board index, 316; 
derived from price indices, 319 ff.; 
of industriai productivity, 321 
Index numbers of wholesale prices, 
167, 216 ff.; purpose of, 170; con- 
struction of, 180 iff., 208 jBP.; aggre- 
gative type, 181; arithmetic aver- 
age type, 183; weighted, 196; of 
farm crop prices, 182, 189; geomet- 
ric average type, 185, 198; median 
type, 185; harmonic type, 186; 
comparison of types, 188; time 
reversal test, 190; weighted types, 
193 ff., 198; alternative types, 
204 ff.; commodities to be included 
in, 209 

Index of correlation, see Correlation 
index 

Index of variability, 157 
Induction, statistical, 452 ff., 698 ff.; 
nature of, 453; measures of reliabil- 
ity, 464 ff.; generalizing from small 
samples, 598 ff. 

Industrial change, measurements of, 
322 

Inference, statistical, see Induction 
Interaction, of principles of classifica- 
tion, 688 

Interpolation, 70, 81, 277; for the 
median, 114; for the mode, 118; for 
monthly trend values, 273; in 
Fisher’s z table, 500; double inter- 
polation, 507 

Irrigation, correlated with alfalfa 
yield, 404 ff. 

Jones, D. C., binomial distribution, 
660 

Karsten, Karl G., 278 
Kelley, Truman L., 206; reliability of 
constants, 485 
Kendall, M. G., 629 
Keynes, J. M., random sampling, 461 
Killough, H. B., 569 
Knibbs, Sir George, 214 
Kurtosis, 100, 137, 159 
Kurtz, Edwin, 77 

Lag, in time series analysis, 390 ff.; 
changes in different cycle phases, 
397 


Laspeyre’s indexnumber, 193, 214 
Law of large numbers, 455 
Least squares, method of, 246 ff., 
638 ff . ; applied to linear relations, 
246, 328, 354, 366, 509; applied to 
power curves, 260, 405; applied to 
logarithmic curves, 264 ff.; in cor- 
relation analysis, 366, 373, 405 
Leptokurtic, 449 
Life table, 77 

Line of regression, see Regression 
Linear correlation, see Correlation, 
linear 

Linearity, test for, 423; by variance 
analysis, 508 ff.; see also Linear 
relationship 

Linear relationship, 14, 16, 26, 325 ff.; 
fitting by least-squares, 246 ff.; in 
business series, 257, 268; between 
discount rates, 348; tests for, 423, 
477, 508 

Link relatives, 204 

Logarithmic, equation, 26 ff 563, 
569 ff., 671; mean, 128 ;,. also 
Geometric mean; paper, iSl, 227; 
deviation, 265; function of the cor- 
relation coefficient, 614 
Logarithms, common, 23 ff., 492, 572; 
use in computing the geometric 
mean, 125, 130; use in curve fitting, 
264 ff., 269; Naperian, 435, 492; 
Appendix table X, 709 
Logistic curve, 272, 675 

■ ■ 

Macaulay, F. R,, 185, 244 
Malenbaum, Wilfred, 565 
Mantissa, 24 

Manufactured goods, rdle in price 
movements, 213 

Mean, arithmetic, see Arithmetic 
mean; geometric, see Geometric 
mean; liarmonic, see Harmonic 
mean 

Mean deviation, 139 ff. 

Mean product, 351, 358 
Measurement of, central tendency, 
see Central tendency; relationship, 
see Relationship, etc. 

Median, definition of, 102; location of, 
109 ff.; computation of, 113; i 
graphic location of, 120 ff.; chai>| ; 
acteristics of, 134; relation to mean ; 
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deviation, 140; of relative prices, 
185; standard error of, 472 
Merriman, Mansfield, 90 
Mesoknrtic, 449 
Minor, J. R., 556 

Mitchell, W. C., 93, 173, 176, 212, 
242, 303; comparison of index num- 
bers, 209; business cycles, 467, 483 
Mode, 96; definition of, 101; location 
of, 115 ff.; graphic location of, 
120 ff. ; characteristics of, 135 
Moments, of frequency distributions, 
440; about the mean, 442 
Monthly trend values, 272 fi. 
Mortality tables, 80 
Moving average, 234 fi.; application 
to non-linear series, 239; measure- 
ment of seasonal fluctuations, 
285 ff.; use in correlating cycles, 
398 n. 

Mudgett, Bruce D., 216 
Multiple correlation, see Correlation, 
multiple 

Multiple frequency table, 289 

Napierian logarithm, 435 
National Bureau of Economic Riv* 
search, 244, 320, 397; study of in- 
come distribution, 132; construc- 
tion of index numbers, 219; study 
of production change, 309 
National Industrial Conference Board, 
cost of living index, 221 
Natural number, 24, »28; table of 
squares of, 706; sums of powers, 708 
New York Census of Manufactures, 
309, 317, 371 

Non-linear correlation, see Correla- 
tion, non-linear 

Non-linear relationship, 404 ff.; see 
(dm Parabolic and exponential func- 
- lion 

Normil deviate, 437, 599; table of, 
603,699 

Ndrmal eqiAtiona, for Imw reWon- 
iMp, ,249; pa»boBc, 254,; of multi- 
rektlonahip, 537 ff , ; deriva- 
tion of, 639; formation of, 640; 
on, 648; Boolittte ioiulfen 

of error, W, 153, 425. f., 


436; its use, 438; economic appli- 
cation of, 440 ff.; criteria for, 444; 
fitting the normal curve, 445 ff. ; 
distribution, 332, 371, 458; de- 
parture from, 374, 378; computa- 
tion of theoretical frequencies, 446; 
generalization of results, 448; of 
the distribution of means, 464; use 
in measures of reliability, 464 ff.; 
area under, 437, 699; test of good- 
ness of fit of , 627 
Null hypothesis, 475 

Ogive, 80 ff., 85 

Organization of data, 51, 82, 100; in 
time series, 226 

Origin, arbitrary, 107, 351; at point 
of averages, 353, 365 
Orthogonal polynomials, 270 

Paasche^s index number, formula for, 
195, 215 

Pabst, Margaret, 378, 479 
Parabolic curve, 16, 21, 27, 270, 577; 

see also Parabolic function 
Parabolic function, fitting of, 253 ff.; 
second degree, 260, 405; logarith- 
mic, 264, 269, 270; testing para- 
bolic hypothesis, 514 ff. 

Parameter, 457 

’^ Pareto, Vilfredo, law of income dis- 
' tribution, 132 

Partial correlation, see Correlation, 
partial 

Peake, E. G., 94 

Peakedness, 100; see also Kurtosis 
Pearl, Raymond, 271, 272; formation 
of normal equations, 642; logistic 
curve, 675 

Pearson, Karl, 156, 158, 254, 436; 
coefficient of correlation, 335; cor- 
relation ratio, 413 ff.; curve types, 
448; descriptive measures of fre- 
quency distributions, 448 ff.; statis- 
tical inference, 454; Chi-square dis- 
tribution, 618 ff., 626 
Percentages, difference between and 
significance of, 483 
Percentile, 114 

Periodic fluctuation, 230; removal by 
' moving averts, 235; see aim 
Ses»nal and eyclieal variation 
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Periodic function, 21 
Persons, Warren M., 204; analysis of 
cycle lags, 390 ff. 

Platykurtic, 449 

Polynomial, orthogonal, 270; see also 
Parabolic function 
Population, statistical, 453, 454, 456 
Potential series, 21 

Power series, 253; see also Parabolic 
function 

Price relative, 162; arithmetic aver- 
age of, 183 

Price, wholesale, 93, 168; index num- 
bers of, 161 ff., 167, 216 ff.; see also 
Index number; price ratios, 171 ff.; 
measurement of change of, 174; 
wholesale groups, 211; index of re- 
tail, 220; of farm products, 222; 
deflation of, 279; measurement of 
variation in, 493 

Probable error, 152, 155; of index 
numbers, 206; see afso Standard 
error 

Probability, 603; principles of, 425 ff.; 
addition of, 427; measurement of, 
429; a priori, 431; empirical, 431; 
normal, 439, 459, 471; normal 
table of, 699; integral, 436 
Probability, curve 98; see also Norm&l 
law of error 
Probable value, 332 
Production, statistics of, 10, 35, 40,- 
43, 47; of fuel, 163, 265; of crops, 
192; as measured by index numbers, 
305 ff. ; see also List of charts 
Product-moment method, 349 ff., 
368; for classified data, 354 ff . 
Projection, of trend values, 277, 402 
Purposive selection, in sampling pro- 
cedure, 462 

Quartile, 114; graphic location of, 
120 ff.; deviation, 150 ff., 154; 
standard error of, 473 

Random fluctuations, 231; removal 
by moving averages, 241 
Random sampling, 458, 461; see also 
Sampling 

Range, of variation, 139, 154; semi- 
interquartile, 151 


Rank correlation, 374 ff.; see also 
Correlation, rank 

Rate, of interest, 30, 76, 228; of 
change, 40, 267, 278, 587; of ex- 
change, 94; averaging of, 125 
Ratio, chart, 29, 35 
Ratio, correlation, 413; see also Cor- 
relation ratio 

Reciprocals, use in measuring rela- 
tionship, 578 ff., 675 ff. 

Reed, Lowell, J., 272; logistic curve, 
675 

Reference cycles, 243, 262; correla- 
tion of, 382 
Regimen, 214, 322 

Regression, lines of, 359ff.; use of, 
364 ff., 367, 423, 607; coefficient of 
regression, 359 ff., 363, 479, 561, 
607; for cotton production and 
price, 387, 401; standard error of 
coefficient of regression, 479, 607, 
609 

Relationship, between income and 
auto registration, 326 ff., 352; meas- 
urement of, 325 ff., 334; between 
discount rates, 340 ff.; between 
time series, 380 ff. ; temporal, 391 ff. ; 
linear, see Linear relationship 
Relative deviations, 129; weighted, 
167 

■ Relative price, 162; arithmetic aver- 
age of, 183; geometric average, 
185, 198; harmonic average, 187; 
weighted average, 196 
Relative variation, measurement of, 
156ff.,264 

Reliability, measures of, 464; of the 
mean, 464; of the difference be- 
tween means, 481, 483; of the me- 
dian, 472; of the standard deviation, 
473; of the coefficient of correlation, 
474; index of correlation, 477; coeffi- 
cient of regression, 478 
Residuals, 247 

Residual variability, see Variability, 
residual 

Retail price, index of, 220 
Richardson, A. H., 49 
Rietz, H. L., 143 

Robertson, R. D., 405 , 

Robinson, G., 88, 274, 465 
Root-mean-square deviation, ^^146, 
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276, 330, 416; see also Standard 
deviation 
Rulon, P. J., 485 

Sample, size of, 117; estimates from, 
146; in constructing index numbers, 
206 

Sampling, problem of, 452 ff., 460; 
random, 458, 461 ; generalizing from 
small samples, 598 ff.; errors of, 
293, 447; see also Standard error 
Sasuly, Max, 270 
Scale, for curve reading, 39 
Scatter, 99; degree of, 137, 334, 409, 
414, 646; see also Variation 
Scatter diagram, 326, 328, 348, 370, 
416 

Scott, Fx'ances V., 219 
Seasonal variation, 230, 284 ff., 380; 
removal by moving averages, 235; 
me^urement of, 287 ff.; adjustment 
of, 317; test of significance of, 522 ff. 
Secular trend, 229, 380, 487 ; of cotton 
production and price, 383, 385; 
measurement of, 231 ff.; represen- 
tation by moving avc*rage, 234; 
by mathematical curves, 244 ff., 
667 ff,; of business series, 257 ff.; 
selection of curve, 274 ff. 

Selection of curve of trend, 274 
Bemi-interquartile range, 151 
Semi-logarithmic charts, 28, 264 ; 
advantages of, 40 

Series, periodic, 21; #x>tential, 21; 
continuous, 75 

Sheppard, W. F., correction for 
grouping, 150, 442 ff.; table of 
normal areas, 436 

Bhewhart, W. A., 49; distribution of 
the standard deviation, 600 ff . 

twts of, 464 see aim 
9fcandi|,rd error 

figurw, 485 
&i« Wire, 21 

! '96; mwsuM of, 100, 122, 

' , 137 I5f ff*, 449; of gwmetric 
ittritti, 129; of the standard devia- 
Mom, ‘600; of the 'correlation 

■« 'oAolint, 130 , , , r' 

IW; 'Of .rogfwlpp hue, W, 

, , |l§, ^1; m Rup^iOn 

, wiffcteiil ' 


Smoothing of curves, 69 ff., 76, 117 
Snedecor; George W., 449, 688 
Snyder, Carl, 229 
Spurr, W. A., 293 

Squares of natural numbers, table of, 
706 

Standard deviation, 145 ff., 330, 
371, 416; characteristic features of, 
155; use in adjusting index num- 
bers, 311, 393, 395; in terms of 
moments, 443; about the means 
of arrays, 418; use of, in variance 
analysis, 491, 494; see also Standard 
error 

Standard error, of the binomial dis- 
tribution, 434, 660; of the mean, 
464, 664; of the difference of 
means, 481, 483; of the median, 
472; of the standard deviation, 473; 
of the correlation coefficient, 474, 
545; of the correlation index, 477; 
of the regression coefficient, 478; of 
the partial correlation coefficient, 
560, 615; of the z function, 493, 
615; limitations of abov(‘ measures, 
486 ff . 

Standard error of estimate, 330 ff.; 
computation of, 333, 338, 370, 388, 
401, 406, 590; short-cut calculation, 
346, 354; of parabolic functions, 
410; significance of, 349, 371; about 
line of regression, 480; correction 
of, 413, 542; in multiple correla- 
tion analysis, 534, 541 ff.; of loga- 
rithmic functions, 571 ff.; in ratio 
terms, 573; in reciprocal terms, 
581; zones of estimate, 590 ff. 

Starr, G.W., 312 
Statistic, 457 

Statistical description, see Descrip- 

. . tta'. 

Statisticid induction; see Induction 
Steinmetz, C. P., 256 
Stewart, Ethelbert, 84 
Stock price cycles, relation to Ixubi- 
new activity, 390, 397 
Straight line, fitting of, 246; *s*€e aUo 
Linw relatiouBhii) 

Stratification, in sampling pn,K>edure, 

m 

Stmtified purposive sampling, 463; 
standard error of, 472 
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^‘Student/’ standard error of the rank 
correlation coeftcient, 479; stand- 
ard error of the mean, 599; distribu- 
tion of the standard deviation, 
600 ff. 

Sturges, H. A., 57 
Symbols, glossary of, 691 ff. 

Symmetry, 100; degree of, 120; see 
also Skewness 

Table, of areas under the normal 
curve, 699; Fisher i table, 603, 
700*. of significant values of tlie cor- ' 
relation coefficient, 612, 701; of 
relations of the correlation coeffi- 
cient to the 2 ! function, 702; of the 
distribution of z, 704--5; of the 
powers of natural numbers, 706, 
708; of common logs, 709 
Tabulation of data, 51, 62; in cor- 
relation tables, 341, 354, 415 
Tendency, central; see Average's and 
Central tendemey 
Thompson, F. L., 292 
Time reversal test, HK) 

Time series, charts, 33, 18, 50; 

analysis of, 225 ff., 295; graphic 
rcprescaitation, 227; removal of 
cycles, 234; fitting a line to, 252; 
measurement of s(‘asonal fluctua,- 
tion, 284 ff.; of cyclical fluctuation, 
284; measurement of relations be- 
tween, 380 ff.; see also Correlation' 
of time series 
Tolley, H. 11., 537, 652 
Tj'end, 262; of price movements, 170; 
of monthly values, 272; sehn't ion of 
curve of, 274 ff.; measurement of, 
225 ff . ; secular, see Secular trend 

Ungrouped data, 109; product mo- 
ment method for, 352 
Uniformity of nature, principle of, 
457 

Unweighted index numl)er, 184 
U, S. Bureau of Internal Revenue, 
S26 

U. S. Bureau of Imbor Statiitics, 
statistics of fu(4 production, 164; 
index of wluiiemle prices, 168, 172* 
176, 212, 216 ff., 282; index number 


used, 193; index of retail prices, 
220; cost of living index, 221 
IT, S. Department of Agriculture, in- 
dex of farm prices, 222 

Variability, measures of, 490 ff., 560; 
between classes, 494 ff.; absolute, 
586; residual, 526, 689; see also 
Variance and variation 
Variable, 11; relations between vari- 
ables, 325 ff., 359, 360 
Variance, analysis of, 490 ff.; z test 
of difference in variability, 492, 
.506, 513, 517; in testing variability 
between classes, 494; in the meas- 
urement of relationship, 501 ff., 
519; in testing linearity, 508 ff.; 
curvilinear hypothesis, 514 ff.; test- 
ing seasontd fluctuation, 522 ff. ; 
in testing the multiple correlation 
coefficient, 545; in testing signifi- 
cance of principk's of classification, 
681 ff. 

Variation, 97; imsasurcs of, 99, 137 ff., 
330; absolute, 138; comparison of 
measures of, 153, 155; measures of 
difference in, 41K)ff.; coefficient of, 
156; in price relatives, 171 ff.; 
within and between array.s, 502; 
see also Seasonal and cyclical fluc- 
timtion 

Verlmist, F. F., 272 

Wage Btatisti<|s of, 96, 103, 105, 111, 
124 

Wahr, George, 437 

Walsh, C. M,, 130, 201 ; ratio variabil- 
ity, 596 

Weighted average, 104, 10t>; of rela- 
tive prices, 106; geometric, 125; 
moving average, 244 
Weldon, W. F. R., dice experiment, 
432, 618 

Wheat, exports of, 33; yield corre- 
lated with fertili^^r, 415 
Whipple, G. a, 87 
Whittaker, E. T., 88, 274, 465 . 
Wholesale pri«, 211 ff.; index of, 
216 ff.; see aim Fdm 
Working, Holbrook, harmonic mean, 
S8i 
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Yates, F., 688 ' 2 test of variability, 492, 506, 514, 

Yule, G. XJ., 60, 418; Chi-square fre- 517; tables of, 704-6; standard error 
quencies, 622, 629 of, 615 

2 transformation of correlation coeffi- 
2one, of estimate; see Estimate, Dis- cient, 613 ff., 702 
persion 



